How to split MongoDB Replica set without Initial Sync
- Vivek Shukla
- Aug 27, 2022
- 7 min read
As part of data center migration project, we had a requirement to breakdown existing MongoDB replicaset whilst making sure initial data synchronisation is avoided.
What is "Initial Sync" and when does it happen?
Sometimes replica set members fall off the oplog and the node needs to be resynced. When this happens, an Initial Sync is required, which does the following:
Clones all databases except the local database. To clone, the mongod scans every collection in each source database and inserts all data into its own copies of these collections.
Applies all changes to the data set. Using the oplog from the source, the mongod updates its data set to reflect the current state of the replica set.
Why to avoid "Initial Sync"?
Primary reason to avoid initial sync when splitting MongoDB replica set is to save time. For large datasets, the initial sync can take longer and not meet the requirement of having data available with minimal to no downtime.
To avoid initial sync, we made sure all replica set members were caught up and in sync before proceeding with following steps to split replicaset into two.
Step 1: Create mongod replicaset with following 6 members in replicaset called "replset"
Replica set members:
127.0.0.1:27017
127.0.0.1:27018
127.0.0.1:27019
127.0.0.1:27037
127.0.0.1:27038
127.0.0.1:27039
ps aux | grep mongod
root 88350 0.8 1.6 1508736 66112 ? Sl 08:01 0:06 mongod -f /etc/mongod1.conf
root 88428 0.8 1.6 1435240 66312 ? Sl 08:01 0:06 mongod -f /etc/mongod2.conf
root 88514 0.8 1.6 1435220 65568 ? Sl 08:01 0:06 mongod -f /etc/mongod3.conf
root 88719 5.6 1.5 1025972 61560 ? Sl 08:14 0:00 mongod -f /etc/mongod10.conf
root 88748 7.0 1.5 1025968 61568 ? Sl 08:14 0:00 mongod -f /etc/mongod11.conf
root 88777 10.0 1.5 1025968 61196 ? Sl 08:14 0:01 mongod -f /etc/mongod12.confStep 2: Remove mongod10, mongod11 and mongod 12 from replset1 from primary replicaset member
cfg = rs.conf()
printjson(cfg)
cfg.members = [cfg.members[0] , cfg.members[4] , cfg.members[7]]
rs.reconfig(cfg, {force : true})Step 3: Connect to mongod10, mongod11 and mongod12 and update local database with replicaset name
At this stage, it is worth noting that the all three removed members are showing "OTHER" status and not "PRIMARY" or "SECONDARY".
root@osboxes:/# mongo --port 27037
MongoDB shell version v3.6.22
connecting to: mongodb://127.0.0.1:27037/?gssapiServiceName=mongodb
repl1:OTHER>
repl1:OTHER> use local
switched to db local
repl1:OTHER> db.system.replset.find()
{ "_id" : "repl1", "version" : 39358, "protocolVersion" : NumberLong(1), "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 1, "host" : "127.0.0.1:27018", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 2, "host" : "127.0.0.1:27019", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "catchUpTimeoutMillis" : -1, "catchUpTakeoverDelayMillis" : 30000, "getLastErrorModes" : { }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 }, "replicaSetId" : ObjectId("602be10a038eacce2c2b3469") } }
cfg = db.system.replset.findOne( { "_id": "repl1" } )
cfg.members[0].host = "127.0.0.1:27037"
cfg.members[1].host = "127.0.0.1:27038"
cfg.members[2].host = "127.0.0.1:27039"
db.system.replset.update( { "_id": "repl1" } , cfg )Step 4: Restart mongod10, mongd11 and mongod12 post configuration changes
root@osboxes:/# ps aux | grep mongod
root 88350 0.7 1.7 1527992 71088 ? Sl 08:01 0:25 mongod -f /etc/mongod1.conf
root 88428 0.7 1.7 1524360 72220 ? Sl 08:01 0:24 mongod -f /etc/mongod2.conf
root 88514 5.7 2.2 1618956 91932 ? Sl 08:01 3:04 mongod -f /etc/mongod3.conf
root 89554 26.1 1.7 1489636 68936 ? Sl 08:40 3:48 mongod -f /etc/mongod10.conf
root 89664 0.9 1.6 1495964 67916 ? Sl 08:40 0:07 mongod -f /etc/mongod11.conf
root 89761 0.8 1.6 1446892 67228 ? Sl 08:41 0:07 mongod -f /etc/mongod12.conf
root@osboxes:/# kill 89761
root@osboxes:/# kill 89664
root@osboxes:/# kill 89554
root@osboxes:/# ps aux | grep mongod
osboxes 10701 0.1 3.4 1063920 137408 ? SLl Mar01 1:09 /usr/lib/mongodb-compass/MongoDB Compass
osboxes 10706 0.0 0.4 379380 16564 ? S Mar01 0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=zygote
osboxes 10708 0.0 0.1 379380 6484 ? S Mar01 0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=zygote
osboxes 10733 1.4 1.1 578572 45548 ? Sl Mar01 14:40 /usr/lib/mongodb-compass/MongoDB Compass --type=gpu-process --field-trial-handle=5821276629990299928,13521613394572191784,131072 --disable-features=LayoutNG,SpareRendererForSitePerProcess --gpu-preferences=IAAAAAAAAAAgAACgAAAAAAAAYAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --service-request-channel-token=7765904788148664306
osboxes 10740 3.5 7.2 7618780 290852 ? Sl Mar01 37:10 /usr/lib/mongodb-compass/MongoDB Compass --type=renderer --js-flags=--harmony --field-trial-handle=5821276629990299928,13521613394572191784,131072 --disable-features=LayoutNG,SpareRendererForSitePerProcess --lang=en-US --app-path=/usr/lib/mongodb-compass/resources/app.asar --node-integration --no-sandbox --no-zygote --background-color=#fff --num-raster-threads=1 --service-request-channel-token=11994226207720253685 --renderer-client-id=5 --no-v8-untrusted-code-mitigations --shared-files=v8_context_snapshot_data:100,v8_natives_data:101
osboxes 10754 0.0 0.3 511356 12988 ? S Mar01 0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=broker
root 88350 0.7 1.7 1527992 71112 ? Sl 08:01 0:25 mongod -f /etc/mongod1.conf
root 88428 0.7 1.7 1524360 72236 ? Sl 08:01 0:25 mongod -f /etc/mongod2.conf
root 88514 5.9 2.2 1618956 91824 ? Sl 08:01 3:12 mongod -f /etc/mongod3.conf
root 90070 0.0 0.0 14224 992 pts/17 R+ 08:55 0:00 grep --color=auto mongod
root@osboxes:/# mongod -f /etc/mongod10.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90078
child process started successfully, parent exiting
root@osboxes:/# mongod -f /etc/mongod11.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90160
child process started successfully, parent exiting
root@osboxes:/# mongod -f /etc/mongod12.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90209
child process started successfully, parent exiting
root@osboxes:/# ps aux | grep mongod
root 88350 0.7 1.7 1527992 71160 ? Sl 08:01 0:25 mongod -f /etc/mongod1.conf
root 88428 0.7 1.8 1524360 72348 ? Sl 08:01 0:25 mongod -f /etc/mongod2.conf
root 88514 5.8 2.2 1610760 92248 ? Sl 08:01 3:12 mongod -f /etc/mongod3.conf
root 90078 6.8 1.5 1408588 61556 ? Sl 08:56 0:01 mongod -f /etc/mongod10.conf
root 90160 7.7 1.5 1049852 60508 ? Sl 08:56 0:01 mongod -f /etc/mongod11.conf
root 90209 16.2 1.4 1048840 59708 ? Sl 08:56 0:01 mongod -f /etc/mongod12.confStep 5: Verify replicaset post restart and make sure we have one primary and two secondary memers
repl1:PRIMARY> rs.status()
{
"set" : "repl1",
"date" : ISODate("2021-03-02T14:00:43.170Z"),
"myState" : 1,
"term" : NumberLong(16),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"readConcernMajorityOpTime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"appliedOpTime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"durableOpTime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
}
},
"members" : [
{
"_id" : 0,
"name" : "127.0.0.1:27037",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 256,
"optime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "could not find member to sync from",
"electionTime" : Timestamp(1614693626, 1),
"electionDate" : ISODate("2021-03-02T14:00:26Z"),
"configVersion" : 39358,
"self" : true,
"lastHeartbeatMessage" : ""
},
{
"_id" : 1,
"name" : "127.0.0.1:27038",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 19,
"optime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"optimeDurable" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
"optimeDurableDate" : ISODate("2021-03-02T14:00:37Z"),
"lastHeartbeat" : ISODate("2021-03-02T14:00:42.026Z"),
"lastHeartbeatRecv" : ISODate("2021-03-02T14:00:42.767Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncingTo" : "127.0.0.1:27037",
"syncSourceHost" : "127.0.0.1:27037",
"syncSourceId" : 0,
"infoMessage" : "",
"configVersion" : 39358
},
{
"_id" : 2,
"name" : "127.0.0.1:27039",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 15,
"optime" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"optimeDurable" : {
"ts" : Timestamp(1614693637, 1),
"t" : NumberLong(16)
},
"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
"optimeDurableDate" : ISODate("2021-03-02T14:00:37Z"),
"lastHeartbeat" : ISODate("2021-03-02T14:00:42.025Z"),
"lastHeartbeatRecv" : ISODate("2021-03-02T14:00:42.487Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "",
"syncingTo" : "127.0.0.1:27037",
"syncSourceHost" : "127.0.0.1:27037",
"syncSourceId" : 0,
"infoMessage" : "",
"configVersion" : 39358
}
],
"ok" : 1,
"operationTime" : Timestamp(1614693637, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1614693637, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
Step 6: Verify logs of secondary members to make sure that no initial synchronisation is happening.
2021-03-02T09:00:23.747-0500 I REPL [replexec-0] New replica set config in use: { _id: "repl1", version: 39358, protocolVersion: 1, members: [ { _id: 0, host: "127.0.0.1:27037", arbiterOnly: false, b
uildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:27038", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, sl
aveDelay: 0, votes: 1 }, { _id: 2, host: "127.0.0.1:27039", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true,
heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, w
timeout: 0 }, replicaSetId: ObjectId('602be10a038eacce2c2b3469') } }
2021-03-02T09:00:23.747-0500 I REPL [replexec-0] This node is 127.0.0.1:27038 in the config
2021-03-02T09:00:23.747-0500 I REPL [replexec-0] transition to STARTUP2 from STARTUP
2021-03-02T09:00:23.747-0500 I REPL [replexec-0] Starting replication storage threads
2021-03-02T09:00:23.748-0500 I REPL [replexec-0] transition to RECOVERING from STARTUP2
2021-03-02T09:00:23.748-0500 I REPL [replexec-0] Starting replication fetcher thread
2021-03-02T09:00:23.748-0500 I REPL [replexec-0] Starting replication applier thread
2021-03-02T09:00:23.748-0500 I REPL [replexec-0] Starting replication reporter thread
2021-03-02T09:00:23.748-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:23.748-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:23.749-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 127.0.0.1:27039 - HostUnreachable: Connection refused
2021-03-02T09:00:23.749-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 127.0.0.1:27039 due to failed operation on a connection
2021-03-02T09:00:23.749-0500 I REPL_HB [replexec-0] Error in heartbeat (requestId: 3) to 127.0.0.1:27039, response status: HostUnreachable: Connection refused
2021-03-02T09:00:23.749-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:23.750-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:27037, took 2ms (1 connections now open to 127.0.0.1:27037)
============================================
2021-03-02T09:00:23.750-0500 I REPL [rsSync] transition to SECONDARY from RECOVERING
2021-03-02T09:00:23.750-0500 I REPL [rsSync] Resetting sync source to empty, which was :27017
2021-03-02T09:00:27.474-0500 I NETWORK [listener] connection accepted from 127.0.0.1:57632 #2 (2 connections now open)
2021-03-02T09:00:27.474-0500 I NETWORK [conn2] end connection 127.0.0.1:57632 (1 connection now open)
2021-03-02T09:00:27.476-0500 I NETWORK [listener] connection accepted from 127.0.0.1:57636 #3 (2 connections now open)
2021-03-02T09:00:27.476-0500 I NETWORK [conn3] received client metadata from 127.0.0.1:57636 conn3: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.6.22" }, os: { type: "Linux", name:
"Ubuntu", architecture: "x86_64", version: "16.04" } }
2021-03-02T09:00:27.761-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:27.761-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:27039, took 0ms (1 connections now open to 127.0.0.1:27039)
2021-03-02T09:00:27.762-0500 I REPL [replexec-0] Member 127.0.0.1:27039 is now in state SECONDARY
2021-03-02T09:00:28.751-0500 I REPL [rsBackgroundSync] sync source candidate: 127.0.0.1:27037
2021-03-02T09:00:28.751-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:28.752-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 1ms (1 connections now open to 127.0.0.1:27037)
2021-03-02T09:00:28.753-0500 I REPL [rsBackgroundSync] Changed sync source from empty to 127.0.0.1:27037
2021-03-02T09:00:28.754-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:28.755-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 1ms (2 connections now open to 127.0.0.1:27037)
2021-03-02T09:00:47.512-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:47.514-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 2ms (3 connections now open to 127.0.0.1:27037)
2021-03-02T09:01:47.514-0500 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 127.0.0.1:27037 because the pool meets constraints; 2 connections to that host remain open
2021-03-02T09:02:02.516-0500 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:02:02.518-0500 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 2ms (3 connections now open to 127.0.0.1:27037)
Comments