Thursday, 24 December 2020

Replication for Beginners

Replication helps improve read throughput by boosting data redundancy even if it comes at the expense of having to manage data consistency between different replicas. 


Replication strategies - 


Leader based replication - write to a leader, followers catch up with leader eventually


A more durable approach is for write from client to be waiting till the write has propagated to at least one follower in addition to the leader.


Purely Synchronous replication, Most Durable- Client writes to leader and blocks until all followers have gotten a copy of the write from the leader


Semi-Synchronous replication, Less Durable than Purely synchronous but More Available - Client writes to a leader and blocks until some number of followers, “w” have obtained a copy of the write.



How does follower catch up with leader ?


  1. When writes happen, one of the parties, client or the server (leader/ cluster manager) generates a sequence id for the write that is either auto-incrementing or corresponds to the timestamp when the write was received from the client.

    In 1, we mentioned leader or cluster manager does this sequence id generation operation but there are merits to both parties discharging this responsibility. In typical distributed systems that do not have a dedicated cluster manager, leader could assume this responsibility. In others, cluster manager may be a better bet to introduce greater separation and distribution of concerns. There are challenges with both approaches, when leader generates sequence ids, the reliability of sequence id generation are more tightly dependent on the reliability of leader election and subsequent new leader catching up with the former one’s write logs. When cluster manager is introduced in a set up, it is recommended that a high-performant, costly instance is used to reduce the likelihood of this system not failing or otherwise reliability of the system needs to be improved with a secondary cluster manager that will be catching up with the primary one from time to time to maintain durability of writes at all times. 


  1. Going back to 1, after leader has received a write, it typically writes it to a write log (Similar to a write log in databases) first before writing it to its own index and subsequently to the disk. Note that individual implementations may differ based on the challenges involved in keeping writes durable in its environment or application scenarios. For example, some systems may treat a write to leader as completed immediately after a memory-based index has been written to while others may wait until the memory-based index write has also been written to a storage system or write block on disk.


  1. Once the write has completed on the leader, the writes happen to followers at a very simplistic level, based on a push or pull model.

    In Push-based replication models, leader typically writes to a durable and robust queue that guarantees writes to it are never lost and that reads from it also never fail to deliver to any of the consumers registered to it as subscribers. If the robustness of a separate queue aren’t needed, writes can also propagate from the leader to followers in a peering-based or star-based model or ring-based model. Note that replication from leader to followers themselves may be architected in a multitude of different ways based on the specific application scenarios and conditions.

    In Pull-based replication models, followers could poll the leader for all writes that happened on the leader since a sequence-id passed to it by the follower and the changes received from the leader may be applied to the followers’ own indexes and storage systems. 

Sunday, 13 December 2020

NoSQL vs SQL For Beginners

Often times, we compare and contrast two popular DBMS paradigms, NoSQL and SQL. While we are all conversant with the basics of these two kinds of systems, significant developments in the recent versions of two technologies themselves has rendered a lot of our understanding moot.

5 years back, the following would have been the most common distinction between NoSQL and SQL as far as their applicable use cases concern-

NoSQL -

  1. You have a read/write heavy system with little to no updates.
  2. You want horizontal scaling flexibility
  3. You have a simple lookup use case with no/minimal joins between related collections/tables.
  4. You need the flexibility of a schema-less design and have arbitrary schema for each document in your collection.
  5. You believe that denormalized data storage eliminates need for joins at runtime.
SQL - 
  1. Your system is transactional in nature and as such needs to leverage ACID capabilities offered by a typical SQL solution
  2. You conceive little to no changes in data model/schema.
  3. You have planned your design for at least 3 years ahead into the future and any further performance improvements needed then shall be addressed simply with vertical scaling of your database servers.
  4. Oracle is the leading enterprise SQL solution and you don't need all the bells and whistles of a paid solution for a backend system that supports clients within your own enterprise.

Many of the above are simply true even today and need no further explanation. But here are a few that I believe would benefit from explicit reiteration - 

The premise that SQL databases cannot be horizontally scaled is simply not true. You can horizontally scale even your SQL databases by sharding your indexes if you think it will grow too huge for an index that can be completely held in memory. 

I guess about 15 years back, we were living in a time where SQL and NoSQL had a strict distinction over their respective capabilities but today the gap between them has bridged greatly. Many characteristics of SQL have been borrowed into NoSQL solutions (think of joins, various consistency models other than eventually consistent, transactional capabilities etc.) and vice-versa (think of schema-less design with json/clobs).

So next time you get asked this question in an interview, feel free to confidently clarify any misconceptions the interviewers may themselves have.

Appendix- 
ACID - Atomicity, Consistency, Isolation, Durability
BASE - Basically Available, Soft state, Eventually consistent

Friday, 19 June 2020

How naive implementation of find in Union-Find is quadratic in number of nodes

For naive implementation of find in union-find, how is the cost O(N^2)? Are all nodes in the graph at the same depth? Is that even possible if they are connected?

Let's take an example of graph of "n" nodes with maximum height, which looks somewhat like a left-skewed or right-skewed tree.

In a skewed tree, the leaf itself is at a depth N-1. For every node above it, depth is decreasing by one, i.e. N-2, N-3, N-4, ...,3,2,1, 0.
If we add this together, 0 + 1+2,+3+...N-1 => (ignoring 0, we get) (N-1) (N)/2 = O(N^2)
One of those instances when math clarifies logic. I will add code snippets to the the above shortly.