How I understood intuition behind Replication and Sharding...
Problem Statement :-
Have you ever imagined how a large database system serves each one of user the data they want in a jiffy. Lets take example of Whatsapp, we all globally use this app for communication, daily billions of messages are passed on to and fro. Now the question one can ask is how does a DB server achieves this and gives a best in class user experience. The moment you open your Whatsapp you see all the recent contacts, and inside them all your recent chats right there in a blink of a blink.
So us being an engineer should think what all can be the pain points here while serving to our users. There can be :
1. Say too many reads happening and the DB can outload itself and crash at high read times.
2. Or one can think the user base grew so much that the messages are outnumbering the DB capacity or DB server processing power.
Lets explore on how above two problems is dealt in most of the high scale platforms serving billions of data points reliably and in a blink of a blink.
Initial Thinking and My learnings
Lets take above two situation one by one and build a intuition on what best can be done to solve above with the trade-offs of solution.
1. Too Many Reads
We all get it serving too many users from a single point is not a good idea. Say you have a popular Ice-Cream shop and the customer grew alot in numbers. You being owner of shop is thinking how to reduce this traffic of customers, you don't want to loose on customer also you want to serve each one of them. So you think of creating multiple Kiosk serving the exact same flavours of ice-creams at different location. So you have a primary store where you make all of ice-cream and give out to Kiosk to serve customers.
So above we have reduced the customer traffic by placing some stores at multiple places, and the user will go to the nearest available store to get the Ice-Cream he wants. Hence, the load on a single shop got reduced and we kept care of serving each one of willing to have the Ice-Cream.
We follow similar practise in code world, instead of one primary DB server serving requests of all the users the message they need, we create replicas of the primary DB. These replicas then serve the users and share the load among themselves with clever techniques. This is called Replication and is a base fix to the problem where lots of read is happening from a single source.
Thus, we can create multiple replicas of primary DB and this reduces the load from the primary DB.
How this benefits in crux we can say:
Better Read: now users wont face downtime and we can smartly reflect the user to any of the available replica to have them their messages.
High Availability: system is now say 99 % of the time available. Very less downtime happens. If one replica is under migration load or simply not available then the other replica's can share the load and serve the user.
Less load on Primary DB: our master DB aka primary DB will be having less read load, and it can focus on other important aspects, say writes and making the replica's in sync.
2. Billions of data point !!! The user base grew alot
Here, the pain point is storing all the messages in a single place is not a wise thing to do. Naturally one can think for storing parts of data instead storing all of the data in one place. Lets say if you keep on storing the messages in one single Primary DB, there can be a situation where:
The DB might run out of space.
It can take time to write in large DB, the processing power takes a hit.
Hence, instead of storing whole user's messages in one place. One can smartly assign which DB write messages for what set of users. And while serving we fetch it from that assigned DB for that user.
This parting of primary DB is formally known as Sharding of DB. A very fundamental technique used to solve large growing data problem.
Thus,
Now comes a thought how do we distribute the users among different Sharded DB? We can create a simple equation,
Shard to assign = hash(user_id) % no_of_shard
***One should be curios why we did hashing of user_id, instead of direct "user_id % no_of_shard"
How this benefits in crux we can say:
Horizontal Scaling: We thus have scaled our DB horizontally, and now we can serve more number of users than we could achieve with a single primary DB. PS: do not look above figure and say ain't it vertical scaling !!!
More writes: Now we can handle lot more number of writes into our each Sharded DB. Thus solving high number of write problem.
One thing to note down here is that in practise we use mix of both Replication and Sharding... Cause combining both solves both high read and high write of our concerned data unit Whatsapp messages in our case.
Thus,
Takeaway
In above read we have build intuition for need of Replication and Sharding of DB and what problem does it solves. Which enables us to serve millions of users with their billions of message in a blink of a blink. Keeping high availability and low latency. And this both technique combined forms the base on which Whatsapp builds up their fundamental to serve each users the message they deserve to see in a jiffy...
