Friday, 19 June 2020

How naive implementation of find in Union-Find is quadratic in number of nodes

For naive implementation of find in union-find, how is the cost O(N^2)? Are all nodes in the graph at the same depth? Is that even possible if they are connected?

Let's take an example of graph of "n" nodes with maximum height, which looks somewhat like a left-skewed or right-skewed tree.

In a skewed tree, the leaf itself is at a depth N-1. For every node above it, depth is decreasing by one, i.e. N-2, N-3, N-4, ...,3,2,1, 0.
If we add this together, 0 + 1+2,+3+...N-1 => (ignoring 0, we get) (N-1) (N)/2 = O(N^2)
One of those instances when math clarifies logic. I will add code snippets to the the above shortly.




Sunday, 23 October 2016

Key insight- Consistent hashing

I had glossed over the overview of Consistent Hashing many times in the past, and as much as I knew it was a highly important distributed system hashing technique, I never really appreciated the myriad of applications that it underlay.

Today, I started reading Facebook's paper on scaling memcached for supporting its highly scalable and distributed system architecture. I read the term Consistent Hashing once again and this time, I took the onus to read it out in depth to understand the specifics as much as the high level overview I had already read many times in the past. To those of you reading this, I will try to keep it as technical as possible with key mathematical insights from related sources I have already read. Additionally, I will provide references to the original posts as always so that you can catch a glimpse of the material from the biblical sources themselves. Let's get started now.

Most of you know that naive hashing techniques hash a key, say 'k' to a bucketing system with 'n' buckets, 0,1,2,3,...,n-1 using the formula:

hash(k) mod (n)

While this is a good, simple technique for single system, in-memory hashtables, it doesn't quite scale well in multiple node, distributed system architectures. Let me illustrate with an example.

Suppose you add a new bucket, 'n' to the system. You now have n+1 buckets in your system and this requires you to re-hash each of your already present K keys using the new formula:

hash(k) mod (n+1)

Mathematically speaking, this will move n/(n+1) keys in your existing system to a new bucket. This actually involves an awful lot of network transfers and the communication overhead involved is significantly high for most real-time production applications. When you think about it, if this kind of addition of buckets or removal of buckets happens frequently enough, as is the case with most distributed systems that require frequent horizontal scaling in the form of addition of new nodes or removal of existing nodes in the event of node failures, such re-hashing and consequent movement of keys to new buckets has the potential to bring down whole back-end content servers by bombarding them with great number of write requests within a short interval of time.

Such was the need to revisit hashing from a new perspective and, in 1997, David Karger published a new paper on Consistent Hashing which still retains wide applicability in most real-time, massive, distributed hashing systems today.

With consistent hashing, the idea is to keep hash value of the key nearly the same and largely independent of the number of buckets/nodes in your distributed system. The simplest and earliest implementations of consistent hashing hashed both keys and buckets using the same hash function that ensured both were normalized to a [0,1) interval.  

Suppose hash of your key is 0.6 and you have you have three buckets with hash values, 0.2, 0.5 and 0.8, respectively. You pick the bucket with hash value that is closest to the hash value of your key in a clockwise direction, i.e., 0.8. Let me quickly illustrate this idea:

In a clockwise direction starting at 0 corresponding to 00:00 hours, the above hash values appear in order, 0.2, 0.5, 0.6 and 0.8 respectively. This would map your key in bucket corresponding to hash value 0.8 as you pick the nearest bucket in the clockwise direction.

If you have understood so much, it can be anticipated with reasonable common sense that you will be concerned by many apparent limitations with this approach, more particularly, concerns about its rather non-uniform mapping from keys to buckets. This is understandable and valid and you will see it being addressed shortly, later in this post but for the time being, I will continue to present some key insights about this approach that will make it seemingly more attractive to the naive hashing technique that was discussed earlier. In this technique, if, say, a new bucket, n is added to the system that has n buckets, 1,2,3,4,....,n-1, only 1/(n+1) keys need to be moved to the new node while the remaining keys continue to stay mapped to their original buckets.

Let me make sure you understand this with an illustration. Any time you add a new bucket to this system, it is trivial to note that only a subset of the keys from the bucket preceding this one in the clockwise direction are going to be mapped to it. As a result, it is rather straight-forward and simple to derive the potential set of keys that require movement as a result of the underlying change in the hashing system itself.

Now, I will proceed further to discuss how a more sophisticated variant of consistent hashing addresses the issue concerning the non-uniform key distribution among the existing buckets. This new design is called consistent hashing with virtual nodes.

The idea behind this new approach is to assign to each bucket, a set of disjoint key ranges or partitions as they are usually called as opposed to a single one as we see in the primordial implementation above. To visualize this partition assignment to buckets, check this link on how consistent hashing is implemented in Cassandra, Cassandra Consistent Hashing.

As you can see it's relatively simple to view this organization as a pre-configured mapping from key partitions to nodes themselves. Such table-like mapping may be conceived at application start up and stored for lookup later in cluster leaders, node managers, master nodes or such other distributed system entities that are entrusted with the responsibility to manage a set of machines. 

I hope this post managed to educate you about consistent hashing or otherwise at least, arouse an interest to learn more about hashing systems in general. 

Related sources:






Sunday, 9 October 2016

Enumerate Primes from Elements of programming interviews

I read the solution to this problem about a year back for the first time and while one time was enough to come to terms with the reality of its ingenuity, one time was never enough to appreciate how it really works.

I would have solved it roughly 5 times over a span of a year and today, I note how a subtle idea can come between you thinking you got it right and actually getting it right. When it comes to problem solving, there are questions you cannot solve, you can solve and you think you can solve.

This is likely to be one of those problems you believe you can solve and you also get the output on your console which if not noted carefully and not rigorously tested would pass of as right and which only when seen through the lenses of authors of the great book like EPI would show as invalid for the first time.

Aziz, Lee and Prakash state that once a number has been identified as prime, it suffices to mark as non-prime, all of its multiples starting its square. The number, say i itself is in index, (i - 3)/2 of a lookup list, isPrime. In other words, if index is i, the value at index i would be 2 * i +3. Now, the square of (2 * i + 3) would be  (4 * i * i + 12 * i + 9) and this square would be present in (4 * i * i + 12 * i + 9 - 3) /2= 2 * i * i + 6 * i + 3

Trust me the genius of these 3 authors would preclude them from providing these step-by-step explanations but it took me good time to decode all these identities for myself.

Now, just tell yourself time and again that the lookup list isPrime only indicates if an odd number starting with 3 is prime because even numbers after 2 are never prime.

if(isPrime.get(i) == Boolean.TRUE) {
/* add number at index i to prime numbers result */
int p  = 2 * i +  3;
primes.add(p);

for (long j = (2 * i * i + 6 * i + 3) ; j  < size ; j += p)  //How obvious is this?{
isPrime.set((int) j , Boolean.FALSE);
}
}

If you read my comment on the code snippet, that's exactly the little subtlety which comes between the way of your getting this one and thinking you got this one.

Interestingly enough, this works. What he is implying is that starting with the index of the prime number square, it suffices to mark all indices at steps of the prime number itself and this works.

E.g. p = 3
p^2 = 9

p_i = 0
p^2_i = 3
p^2_i  + 3 = 6 corresponds to 2 * 6 + 3 = 15
p^2_i + 3 + 3 = 9 corresponds to 2 * 9 + 3 = 21

If you see carefully, he is wisely avoiding all even multiples between 9 and 15 and 15 and 21 and so on. If you test this identity for any other prime, you will see the brilliance of this idea.


  

Thursday, 6 October 2016

Python safeguards


  • Never create a resource with open() or with open() inside run method of a worker thread. Create a resource in the main thread and pass the reference to the resource to the worker instead.
  • with open() is better than open() for writers because it closes the writer automatically when the control exits its scope. This flushes all write buffer to the disk.
  • If open() is used for write(), explicitly call f.close() or f.flush() as you would normally do in Java.

Friday, 9 September 2016

Long.parseLong(String val, 16) gives incorrect Hex value

I recently noted a bug in the Java Long.parseLong function that causes the returned hexadecimal value to be incorrect sometimes when run on different machines. I presume this is likely due to inconsistency in implementations between different java versions, however this needs to be verified.

Interestingly enough, I also verified that the alternate Long.toHexString function gives the correct hexadecimal value.

Just a heads up!

Friday, 6 May 2016

Brian Kernighan's technique to count set bits

Brian Kernighan's iterative technique to count set bits in an integer is described in the link below. Just a little help here to see how it works.

Let's start with an example.
Say for x = 9, you want to count number of set bits (9 = (1001) base 2)
Initialize a count variable to 0;

while (x > 0) {
Compute x - 1 = 8. 
Compute x = x & ( x - 1) 
increment count
}

Number of set bits = final value of "count" variable. Now, to see how it works, it is a good idea to rely on a method of back propagation. If you observe carefully, the stopping condition for the algorithm is when x = 0. It is good to understand when x becomes zero. x becomes zero during the iteration in which the value of x is a power of 2 which would also be when only one bit would be set in x.



Wednesday, 4 May 2016

WORKING!!! A decent hack for an unresolvable Gmail issue!

It has been long and I have something interesting to share here.

About 6 months back, one of my Gmail accounts became victimized by a series of spoofing attacks that continues till this day. For starters, spoofing attacks are not absolutely worrisome. Your account does not get compromised here. It happens when an unknown attacker starts spamming people with your email address in the from or sender section due to which you receive these great many email bounces everyday and also get your account blocked from further email compositions on that day. This is because Gmail has an outbound mail threshold of 500 per day after which any further outbound activity from your mailbox is temporarily blocked.

What is interesting about spoofing itself is that there is no known technique to remediate the attack once it is initiated on your account. As for me, I tried everything possible from filing a bug to reporting to Gmail and I finally understood that I would simply just have to let go. But, that was until my account itself got suspended by Gmail at which point I realized I had to either shut down my account or try to workaround it separately.

If you every find yourself in a similar predicament, here is a simple thing you can do to override account suspension. Gmail lets you set a default email address for all your outbound mails and if you can create a new sender account and have it configured to be the default mail address for all your outgoing mails, what you are in essence doing is let this account take over all unwanted outgoing spams that would have otherwise gone from your mailbox. Of course, this would not be a one time job and you would need to repeat this exercise when inevitably your designated sender account would get suspended too at some point. But, at least, you have something to begin with here.

I have been following the changes for about a week now and I am excited to announce that it has been working till this time :) :)

Saturday, 8 August 2015

Compute Binomial Coefficient


Crux  of the solution is recalling an important mathematical identity:
nCr = (n-1)Cr + (n-1)C(r-1)

//algo
To find nCk; define array a of size n+1;

a[0]=1;
for (int i = 1; i <= n ; i++) {
 for (int j = min(i,k); j>= 1; j--) {
    a[j] = a[j] + a[j-1];
  }
 }


return a[k];




Levenshtein's Distance

For starters, Levenshtein's distance is the number of edits you need to make to a string A to transform it to string B. The edits could be: removal of a character to A to change it to B, addition of a character to A to change it to B or  simply substituting character in A to the one in B.

Application: Spell Checker

Not many sources explain this algorithm well but I happened to find one that explains it really superbly.

Check this out:
http://oldfashionedsoftware.com/tag/levenshtein-distance/



Friday, 17 April 2015

MongoDB backup tipoff

Using MongoDB for distributed storage and management is suitable when it has been identified that the data is largely going to be looked up and occasionally updated but in any case not going to serve online transactions. But, as with any other database system, back-up mechanisms are just as important here.

I have always used mongoexport and mongoimport to export the mongodb collection data and import it back into another collection. But, I didn't know that it was not a reliable means to backup collection data and restore it when needed.

Here is what I observed. Upon mongoexport with version 2.6, some fields get missed out for some documents.  Further, mongoexport doesn't reliably capture data type information either. For example, NumberLong is not imported back as NumberLong when platform changes occur between the machine from where the data is exported and the machine where it is imported back. For my own use case, this later led to data look up issues when using Morphia.

Always use MongoDump and MongoRestore when you want to backup your data and import it back again. I found it to be reliable as it also stores the data type information explicitly.




Sunday, 1 March 2015

Validating sequence of intermixed push and pop operations on Stack

I was reading this interesting problem on Stackoverflow:


The problem is not as complex as it seems at first glance. It clearly stipulates a condition for legitimate push: you may only push values in ascending order. 

Here, the key is to look at a sequence, and when you see a number k in it, understand that all numbers 0,1,2,3,..,k (all m < k) get pushed first. That is, you may not push any r > k before k has itself been pushed on to the stack.

With the retention of that simple idea, you can land at the solution pretty easily. 



Thursday, 26 February 2015

ORM using Morphia: Tip off

Morphia is a great ORM framework that maps your Java objects to mongo documents.

A general advice would be to refrain from using primitive types like int, long, double in the DO and instead substitute them with their wrapper class counterparts. This is because Morphia defaults values for the fields with primitive types in case no value has been entered for them. This is an undesirable behavior most of the times especially when the default values have other meaning in your domain context.


SpringBatch Chunking: Tip off

I have been working with Spring Batch framework for a while now. As much as I have found it comfortable to work with, there are instances when I get confounded by some interesting batch specific idea.

For instance, when you use chunking for data processing, it is a common thing to use a reader, processor and writer for processing the data chunks. 

Note here that the chunk size is not 1 unless it has been explicitly specified so in the batch description xml.

Remember that exclusivity of private data in the chunk processing pipeline is only between items of different chunks and not between items of the same chunk.

For example:

class MockProcessor implements ItemProcessor <K,V> {

private Map <K,V> cache;

public void Process () {

//use cache
cache.put (k,v);
}

}

Here, the cache contents are private and exclusive only for items belonging to different chunks and not for items in the same chunk. So, if you have any item specific processing to do, remember to set the chunk size to 1 or a better idea would be clear and initialize your cache upon each process invocation.