top of page

How to split MongoDB Replica set without Initial Sync

  • Writer: Vivek Shukla
    Vivek Shukla
  • Aug 27, 2022
  • 7 min read

As part of data center migration project, we had a requirement to breakdown existing MongoDB replicaset whilst making sure initial data synchronisation is avoided.


What is "Initial Sync" and when does it happen?


Sometimes replica set members fall off the oplog and the node needs to be resynced. When this happens, an Initial Sync is required, which does the following:

  1. Clones all databases except the local database. To clone, the mongod scans every collection in each source database and inserts all data into its own copies of these collections.

  2. Applies all changes to the data set. Using the oplog from the source, the mongod updates its data set to reflect the current state of the replica set.

When the initial sync finishes, the member transitions from STARTUP2 to SECONDARY.


Why to avoid "Initial Sync"?


Primary reason to avoid initial sync when splitting MongoDB replica set is to save time. For large datasets, the initial sync can take longer and not meet the requirement of having data available with minimal to no downtime.


To avoid initial sync, we made sure all replica set members were caught up and in sync before proceeding with following steps to split replicaset into two.


Step 1: Create mongod replicaset with following 6 members in replicaset called "replset"


Replica set members:

  • 127.0.0.1:27017

  • 127.0.0.1:27018

  • 127.0.0.1:27019

  • 127.0.0.1:27037

  • 127.0.0.1:27038

  • 127.0.0.1:27039

ps aux | grep mongod

root      88350  0.8  1.6 1508736 66112 ?       Sl   08:01   0:06 mongod -f /etc/mongod1.conf
root      88428  0.8  1.6 1435240 66312 ?       Sl   08:01   0:06 mongod -f /etc/mongod2.conf
root      88514  0.8  1.6 1435220 65568 ?       Sl   08:01   0:06 mongod -f /etc/mongod3.conf
root      88719  5.6  1.5 1025972 61560 ?       Sl   08:14   0:00 mongod -f /etc/mongod10.conf
root      88748  7.0  1.5 1025968 61568 ?       Sl   08:14   0:00 mongod -f /etc/mongod11.conf
root      88777 10.0  1.5 1025968 61196 ?       Sl   08:14   0:01 mongod -f /etc/mongod12.conf

Step 2: Remove mongod10, mongod11 and mongod 12 from replset1 from primary replicaset member


cfg = rs.conf()
printjson(cfg)
cfg.members = [cfg.members[0] , cfg.members[4] , cfg.members[7]]
rs.reconfig(cfg, {force : true})

Step 3: Connect to mongod10, mongod11 and mongod12 and update local database with replicaset name


At this stage, it is worth noting that the all three removed members are showing "OTHER" status and not "PRIMARY" or "SECONDARY".

root@osboxes:/# mongo --port 27037
MongoDB shell version v3.6.22
connecting to: mongodb://127.0.0.1:27037/?gssapiServiceName=mongodb
repl1:OTHER> 

repl1:OTHER> use local
switched to db local
repl1:OTHER> db.system.replset.find()
{ "_id" : "repl1", "version" : 39358, "protocolVersion" : NumberLong(1), "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {  }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 1, "host" : "127.0.0.1:27018", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {  }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 2, "host" : "127.0.0.1:27019", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {  }, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "catchUpTimeoutMillis" : -1, "catchUpTakeoverDelayMillis" : 30000, "getLastErrorModes" : {  }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 }, "replicaSetId" : ObjectId("602be10a038eacce2c2b3469") } }


cfg = db.system.replset.findOne( { "_id": "repl1" } )
cfg.members[0].host = "127.0.0.1:27037"
cfg.members[1].host = "127.0.0.1:27038"
cfg.members[2].host = "127.0.0.1:27039"
db.system.replset.update( { "_id": "repl1" } , cfg )

Step 4: Restart mongod10, mongd11 and mongod12 post configuration changes


root@osboxes:/# ps aux | grep mongod

root      88350  0.7  1.7 1527992 71088 ?       Sl   08:01   0:25 mongod -f /etc/mongod1.conf
root      88428  0.7  1.7 1524360 72220 ?       Sl   08:01   0:24 mongod -f /etc/mongod2.conf
root      88514  5.7  2.2 1618956 91932 ?       Sl   08:01   3:04 mongod -f /etc/mongod3.conf
root      89554 26.1  1.7 1489636 68936 ?       Sl   08:40   3:48 mongod -f /etc/mongod10.conf
root      89664  0.9  1.6 1495964 67916 ?       Sl   08:40   0:07 mongod -f /etc/mongod11.conf
root      89761  0.8  1.6 1446892 67228 ?       Sl   08:41   0:07 mongod -f /etc/mongod12.conf

root@osboxes:/# kill 89761
root@osboxes:/# kill 89664
root@osboxes:/# kill 89554

root@osboxes:/# ps aux | grep mongod
osboxes   10701  0.1  3.4 1063920 137408 ?      SLl  Mar01   1:09 /usr/lib/mongodb-compass/MongoDB Compass
osboxes   10706  0.0  0.4 379380 16564 ?        S    Mar01   0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=zygote
osboxes   10708  0.0  0.1 379380  6484 ?        S    Mar01   0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=zygote
osboxes   10733  1.4  1.1 578572 45548 ?        Sl   Mar01  14:40 /usr/lib/mongodb-compass/MongoDB Compass --type=gpu-process --field-trial-handle=5821276629990299928,13521613394572191784,131072 --disable-features=LayoutNG,SpareRendererForSitePerProcess --gpu-preferences=IAAAAAAAAAAgAACgAAAAAAAAYAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --service-request-channel-token=7765904788148664306
osboxes   10740  3.5  7.2 7618780 290852 ?      Sl   Mar01  37:10 /usr/lib/mongodb-compass/MongoDB Compass --type=renderer --js-flags=--harmony --field-trial-handle=5821276629990299928,13521613394572191784,131072 --disable-features=LayoutNG,SpareRendererForSitePerProcess --lang=en-US --app-path=/usr/lib/mongodb-compass/resources/app.asar --node-integration --no-sandbox --no-zygote --background-color=#fff --num-raster-threads=1 --service-request-channel-token=11994226207720253685 --renderer-client-id=5 --no-v8-untrusted-code-mitigations --shared-files=v8_context_snapshot_data:100,v8_natives_data:101
osboxes   10754  0.0  0.3 511356 12988 ?        S    Mar01   0:00 /usr/lib/mongodb-compass/MongoDB Compass --type=broker
root      88350  0.7  1.7 1527992 71112 ?       Sl   08:01   0:25 mongod -f /etc/mongod1.conf
root      88428  0.7  1.7 1524360 72236 ?       Sl   08:01   0:25 mongod -f /etc/mongod2.conf
root      88514  5.9  2.2 1618956 91824 ?       Sl   08:01   3:12 mongod -f /etc/mongod3.conf
root      90070  0.0  0.0  14224   992 pts/17   R+   08:55   0:00 grep --color=auto mongod

root@osboxes:/# mongod -f /etc/mongod10.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90078
child process started successfully, parent exiting
root@osboxes:/# mongod -f /etc/mongod11.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90160
child process started successfully, parent exiting
root@osboxes:/# mongod -f /etc/mongod12.conf
about to fork child process, waiting until server is ready for connections.
forked process: 90209
child process started successfully, parent exiting

root@osboxes:/# ps aux | grep mongod
root      88350  0.7  1.7 1527992 71160 ?       Sl   08:01   0:25 mongod -f /etc/mongod1.conf
root      88428  0.7  1.8 1524360 72348 ?       Sl   08:01   0:25 mongod -f /etc/mongod2.conf
root      88514  5.8  2.2 1610760 92248 ?       Sl   08:01   3:12 mongod -f /etc/mongod3.conf
root      90078  6.8  1.5 1408588 61556 ?       Sl   08:56   0:01 mongod -f /etc/mongod10.conf
root      90160  7.7  1.5 1049852 60508 ?       Sl   08:56   0:01 mongod -f /etc/mongod11.conf
root      90209 16.2  1.4 1048840 59708 ?       Sl   08:56   0:01 mongod -f /etc/mongod12.conf

Step 5: Verify replicaset post restart and make sure we have one primary and two secondary memers


repl1:PRIMARY> rs.status()
{
	"set" : "repl1",
	"date" : ISODate("2021-03-02T14:00:43.170Z"),
	"myState" : 1,
	"term" : NumberLong(16),
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"optimes" : {
		"lastCommittedOpTime" : {
			"ts" : Timestamp(1614693637, 1),
			"t" : NumberLong(16)
		},
		"readConcernMajorityOpTime" : {
			"ts" : Timestamp(1614693637, 1),
			"t" : NumberLong(16)
		},
		"appliedOpTime" : {
			"ts" : Timestamp(1614693637, 1),
			"t" : NumberLong(16)
		},
		"durableOpTime" : {
			"ts" : Timestamp(1614693637, 1),
			"t" : NumberLong(16)
		}
	},
	"members" : [
		{
			"_id" : 0,
			"name" : "127.0.0.1:27037",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 256,
			"optime" : {
				"ts" : Timestamp(1614693637, 1),
				"t" : NumberLong(16)
			},
			"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
			"syncingTo" : "",
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "could not find member to sync from",
			"electionTime" : Timestamp(1614693626, 1),
			"electionDate" : ISODate("2021-03-02T14:00:26Z"),
			"configVersion" : 39358,
			"self" : true,
			"lastHeartbeatMessage" : ""
		},
		{
			"_id" : 1,
			"name" : "127.0.0.1:27038",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 19,
			"optime" : {
				"ts" : Timestamp(1614693637, 1),
				"t" : NumberLong(16)
			},
			"optimeDurable" : {
				"ts" : Timestamp(1614693637, 1),
				"t" : NumberLong(16)
			},
			"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
			"optimeDurableDate" : ISODate("2021-03-02T14:00:37Z"),
			"lastHeartbeat" : ISODate("2021-03-02T14:00:42.026Z"),
			"lastHeartbeatRecv" : ISODate("2021-03-02T14:00:42.767Z"),
			"pingMs" : NumberLong(0),
			"lastHeartbeatMessage" : "",
			"syncingTo" : "127.0.0.1:27037",
			"syncSourceHost" : "127.0.0.1:27037",
			"syncSourceId" : 0,
			"infoMessage" : "",
			"configVersion" : 39358
		},
		{
			"_id" : 2,
			"name" : "127.0.0.1:27039",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 15,
			"optime" : {
				"ts" : Timestamp(1614693637, 1),
				"t" : NumberLong(16)
			},
			"optimeDurable" : {
				"ts" : Timestamp(1614693637, 1),
				"t" : NumberLong(16)
			},
			"optimeDate" : ISODate("2021-03-02T14:00:37Z"),
			"optimeDurableDate" : ISODate("2021-03-02T14:00:37Z"),
			"lastHeartbeat" : ISODate("2021-03-02T14:00:42.025Z"),
			"lastHeartbeatRecv" : ISODate("2021-03-02T14:00:42.487Z"),
			"pingMs" : NumberLong(0),
			"lastHeartbeatMessage" : "",
			"syncingTo" : "127.0.0.1:27037",
			"syncSourceHost" : "127.0.0.1:27037",
			"syncSourceId" : 0,
			"infoMessage" : "",
			"configVersion" : 39358
		}
	],
	"ok" : 1,
	"operationTime" : Timestamp(1614693637, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1614693637, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}

Step 6: Verify logs of secondary members to make sure that no initial synchronisation is happening.


2021-03-02T09:00:23.747-0500 I REPL     [replexec-0] New replica set config in use: { _id: "repl1", version: 39358, protocolVersion: 1, members: [ { _id: 0, host: "127.0.0.1:27037", arbiterOnly: false, b
uildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:27038", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, sl
aveDelay: 0, votes: 1 }, { _id: 2, host: "127.0.0.1:27039", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true,
 heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, w
timeout: 0 }, replicaSetId: ObjectId('602be10a038eacce2c2b3469') } }
2021-03-02T09:00:23.747-0500 I REPL     [replexec-0] This node is 127.0.0.1:27038 in the config
2021-03-02T09:00:23.747-0500 I REPL     [replexec-0] transition to STARTUP2 from STARTUP
2021-03-02T09:00:23.747-0500 I REPL     [replexec-0] Starting replication storage threads
2021-03-02T09:00:23.748-0500 I REPL     [replexec-0] transition to RECOVERING from STARTUP2
2021-03-02T09:00:23.748-0500 I REPL     [replexec-0] Starting replication fetcher thread
2021-03-02T09:00:23.748-0500 I REPL     [replexec-0] Starting replication applier thread
2021-03-02T09:00:23.748-0500 I REPL     [replexec-0] Starting replication reporter thread
2021-03-02T09:00:23.748-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:23.748-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:23.749-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 127.0.0.1:27039 - HostUnreachable: Connection refused
2021-03-02T09:00:23.749-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 127.0.0.1:27039 due to failed operation on a connection
2021-03-02T09:00:23.749-0500 I REPL_HB  [replexec-0] Error in heartbeat (requestId: 3) to 127.0.0.1:27039, response status: HostUnreachable: Connection refused
2021-03-02T09:00:23.749-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:23.750-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:27037, took 2ms (1 connections now open to 127.0.0.1:27037)
============================================
2021-03-02T09:00:23.750-0500 I REPL     [rsSync] transition to SECONDARY from RECOVERING

2021-03-02T09:00:23.750-0500 I REPL     [rsSync] Resetting sync source to empty, which was :27017

2021-03-02T09:00:27.474-0500 I NETWORK  [listener] connection accepted from 127.0.0.1:57632 #2 (2 connections now open)
2021-03-02T09:00:27.474-0500 I NETWORK  [conn2] end connection 127.0.0.1:57632 (1 connection now open)
2021-03-02T09:00:27.476-0500 I NETWORK  [listener] connection accepted from 127.0.0.1:57636 #3 (2 connections now open)
2021-03-02T09:00:27.476-0500 I NETWORK  [conn3] received client metadata from 127.0.0.1:57636 conn3: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.6.22" }, os: { type: "Linux", name: 
"Ubuntu", architecture: "x86_64", version: "16.04" } }
2021-03-02T09:00:27.761-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:27039
2021-03-02T09:00:27.761-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:27039, took 0ms (1 connections now open to 127.0.0.1:27039)
2021-03-02T09:00:27.762-0500 I REPL     [replexec-0] Member 127.0.0.1:27039 is now in state SECONDARY
2021-03-02T09:00:28.751-0500 I REPL     [rsBackgroundSync] sync source candidate: 127.0.0.1:27037
2021-03-02T09:00:28.751-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:28.752-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 1ms (1 connections now open to 127.0.0.1:27037)
2021-03-02T09:00:28.753-0500 I REPL     [rsBackgroundSync] Changed sync source from empty to 127.0.0.1:27037
2021-03-02T09:00:28.754-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:28.755-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 1ms (2 connections now open to 127.0.0.1:27037)
2021-03-02T09:00:47.512-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:00:47.514-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 2ms (3 connections now open to 127.0.0.1:27037)
2021-03-02T09:01:47.514-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Ending idle connection to host 127.0.0.1:27037 because the pool meets constraints; 2 connections to that host remain open
2021-03-02T09:02:02.516-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 127.0.0.1:27037
2021-03-02T09:02:02.518-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 127.0.0.1:27037, took 2ms (3 connections now open to 127.0.0.1:27037)


Recent Posts

See All

Comments


bottom of page