Sunday, 19 September 2021

Difference between VMs and Docker images


Having wondered about how Docker strived to differ from VMs, I embarked on a thought experiment for some time. I finally got some leads today and here is my best-effort explanation in discerning the difference between the two for posterity.

Let's begin by attributing due credit to the team at dotcloud, which was a start up in 2008 era that built Docker and has since come to be known by the same name. This team realized that managing application dependencies and binaries across a number of customers is cumbersome and unreliable.

Think of a product such as Adobe Dreamweaver that is deployed to a number of clusters across different organizations with custom environment setups. If Adobe were built as a standalone application that would be deployed on bare metal instances, that could cause OS level differences such as differing memory management and I/O access strategies from inducing a different behavior on different bare metal clusters.
 
Now think of an improvement that would enable developers at Adobe to have control over the environment of individual deployment centers themselves. That can be understood as one of the possible motivations for creating VMs apart from the obvious stated benefits such as reducing operational costs with better resource management. But a company that builds software products like Adobe has little control over exercising decisions concerning the environment of deployment in their customer application clusters. Hence what is preferable and in fact feasible is a mechanism that allows application builders to directly package their application in an environment that mandates what OS and specific dependency versions are needed to reliably run the application. This is Docker. The environment is contained in the Docker image that also includes the application binary and dependencies. This is without doubt one of the pioneering moments in software build, packaging and deployment history.

Monday, 13 September 2021

Keeping datacenters in sync with asynchronous replication using durable queue

I had often wondered what eventual consistency actually implied in the context of large-scale distributed systems. Does it mean that data would eventually be replicated achieving consistency at a later point in time or that all best attempts would be made to keep disjoint data storage systems in sync before eventually giving up leading to a quasi-consistency scheme in effect?

Over the years, having worked with AWS, third-party clouds and after conceiving a potential working  scheme for consumers of such distributed cloud vendors, I was beginning to understand that the idea of eventual consistency more closely mirrored to what was described about it in theory and resembled less with the cynical view I had first carried about a best-effort, possibly vain scheme to keep data in sync.

Here is a relatively simplistic, high-level depiction about an architecture on the backend that could keep geographically-distributed, trans-continental (just indulging my artistic freedom now) data centers in sync. You can think of the asynchronous queue forwarder as a process/service that collects the writes to be made to the individual data centers albeit indirectly via queuing system that could support concurrent multiple consumers (such as Kafka or SQS), which in this case would be the data centers that would need to be kept in sync. The durability guarantees of the queue should be sufficient to ensure that the data written to would never be lost.

The queue forwarder service could be a standalone, micro-service or otherwise a daemon that would be running locally inside your application container or pod. I have more to comment on micro-service v/s daemon design so stay tuned for a follow-on post. At this point though, I don't necessarily see why one design should be better than the other.