Code Debugger: December 2012

Monday 31 December 2012

What is SaaS? Why is it making big now?

SaaS or Software as a Service is yet another modern day trend in IT that seems to be making news just about everywhere. Like most other contemporary trends, I never really quite got the hang of SaaS and that's precisely why I cared so little about it. This is yet another concept that does not really seem to stir a techno's mind so much till the time he cannot actually associate himself with it, in other words, not until the time he actually gets a first-hand opportunity to experience and employ the technique in the industrial world. But, that's not cent percent true. Really!

I just happened to watch a series of videos on SaaS in the last week. With each video, I gained better and better perspective about this beautiful, modern day trend and I finally seemed to have got the crux of the matter. However, I attained the pinnacle of understanding with this great YouTube video today, this morning, and believe me I think I have now, enough to start writing a page about it. For my readers' reference, here is the link to the video:

As the video just pointed out, the IT world is now moving towards more and more centralized or cloud level administration. Until a few years back, companies that exported software services, typically, deployed the software to the client and the software then worked within the purview and premises of the client organization. This is a pretty good idea, but on better reflection, one can possibly understand that this is just another trend where human potential, time and effort are largely wasted and seldom re-used.

Most of us have an insight into computer science and engineering concepts and we should be aware of object oriented principles of abstraction and re-usability. While the former is not really very significant to this context, the latter is possibly what stirred an IT revolution and gave birth to SaaS.

Reusing existing domain knowledge and thereby saving efforts that are directed towards achieving the same, principal goal is what Re-usability is all about. In other words, never re-invent the wheel. Formerly, software systems were designed for each client organization independently and then deployment and maintenance efforts were targeted at that organization independently of other client organizations which availed of the same software service. Now, I do not really need to ramble any further as it is fairly intuitive as to what is really wasted here. Human efforts and time, principally are the wasted entities and I can just make it all the more clear with an example. Here you are, a software service provider and you take all the effort to deploy and manage the product in one organization and now, you have to repeat the same efforts for another client organization. Just plain old, same work and really boring!

This results from a heterogeneous deployment strategy wherein the software system is installed and configured on the server of each client organization separately. I really do not know who masterminded or championed the cause of SaaS which aims at centralized system deployment on a server held and managed by the software system service provider itself. The key idea here is a Multi-tenanted architecture wherein the clients are charged a monthly rental that is a direct measure of the amount of time they were billed using the software service exported by the service provider. Now, when the provider issues a version update to the system, it can be delivered to all the subscribers availing the service with just a single update to the centralized server that houses the systems used by all clients.

It's just so beautiful isn't it? An idea that can change your life!

Friday 28 December 2012

Harnessing the BIG DATA: How really important is it?

There is this buzz about BIG DATA and the great potential it has to uncover little known, hidden truths. So how really important is this?

For starters, here is a great video I found on YouTube by LinkAnalytics:

Like everybody else, I start off by stating that times have changed and yesterday was certainly incomparably different from today. Formerly, the amount of data in circulation on the web was limited. This was partly due to the relatively lower number of internet users even until ten years back. Also, back in the 90s, the web was a corporate tool (to help them achieve their business goals) and a luxury of the privileged and wealthy few families. The purpose it served then and the multifarious new applications it is serving today are not worthy of a comparison. From business needs to entertainment; from informal, social networking to more professional, corporate-level business integration, the web can do pretty much anything today far beyond what could be envisaged by experts.

What is indeed inevitable with its increased use is the monstrous amount of new data that is getting added to it. Until at least 5 years back, nobody really cared about all that data. We can attest to this truth by taking an example in our own lives. We have been using free e-mail services for over a decade now. Ten years back, we used to do timely deletions of old, unimportant messages to make room for the new messages coming in. That was definitely only because of the storage constraints imposed upon users like you and me by the service providers. Things are completely different today. With more web advertising and cheaper costs of the storage infrastructure, we now have insane amounts of storage that renders message deletion irrelevant!

All the messages are there and until lately, neither you nor the service provider really drew attention towards it. But then this idea dawned upon clever minds that rather than deserting and abandoning all that data which is flowing in everyday, why not analyze them and discover contemporary trends and use them effectively to maximize business revenues. Now, that's what data analytics or data mining is all about! Little, I know about it as I have not yet attained corporate or industrial expertise, but I can surely vouch for this promising field which is going to change our lives and open doors to a more intelligent tomorrow!

Thursday 27 December 2012

Exploring Apache Hadoop and MapReduce paradigm

So what's Hadoop?

For those of you with patience, you can watch an hour long video by Jacob at LinkedIn:

Video source: [VS1]

For others I present here a brief summary regarding the same. In the later 90s and mid 2000s, Google was faced with a challenge to index the gargantuan amounts of data referred to by its search engine. Until the early 90s, the amount of data in the web was comparatively smaller and relational databases could address the contemporary requirements. But, with the decade starting 2000, things were completely different. The number of users on the internet was large and so also the amount of data that was getting added to the web everyday. If one had to build an index for the entire web data, it was going to be tremendously complicated.

What Google did to address this problem was distributing the computation and storage across several nodes. For most us having an insight into CS topics like Distributed Computing, this might seem really trivial, however, you will for yourself learn to appreciate the beauty of their solution as the blog post unfolds. Google developed a technique called MapReduce paradigm that undertook the responsibility to distribute the computation across several computers and combine the results obtained from each of them. I am just giving this superficial, headline to keep you guys reading.

In the MapReduce paradigm, two programs are written for two kinds of computing nodes, viz. mapper and reducer. The mapper program keeps dividing the large task that needs to be computed to smaller and smaller sub-tasks and these smaller tasks are independently computed by different nodes in parallel. The results of these smaller tasks are consolidated and aggregated to obtain the result of the large, original task.

Image source: [IS1]

The reducer is pretty much what does the consolidation and aggregation processes. So hang on: Are the mappers and the reducers software or hardware? The answer is both. You have some nodes that do the mapping function, so these nodes called mappers are loaded with the mapper program. You have another bunch of nodes doing the aggregation of results. These nodes are the reducers and the reducer program is loaded onto them.

Here is an example:

Just consider that you want to count the number of characters in the sentence: I go to school. A first mapper does the task of identifying the different words within the sentence. These words: {I,go,to,school} are handled independently by 4 more mappers that count the number of characters only in the word assigned to them, i.e. mapper 2 counts no. of characters in I, mapper 3 counts no. of characters in go, mapper 4 takes care of to and mapper 5 handles the word school. After distribution of words to mappers 2,3,4 and 5 is done by mapper 1, the mappers 2,3,4 and 5 work in parallel. This provides a speed up as against conventional method of serial computation at a single node. Now after the mappers 2,3,4 and 5 write their results, a reducer combines the result, i.e. adds them together to obtain the final count of the number of characters.

Now, I do understand that some of you might be asking yourself: So what? If that's the case, I wouldn't really be very surprised. Just keep reading and you will learn to appreciate the crux of the discussion here.

Where exactly does Hadoop fit in? I haven't spoken about it in a while. I was just talking about nodes and parallel computation. So what about them? Hadoop provides an effective way to organize the nodes tasked with computation and also to distribute the data across these nodes (this is pretty much the kind of support that MapReduce needs to accomplish its goals).

The MapReduce paradigm first implemented by Google was documented by two scientists at their research labs and a paper provides a complete description of the node deployment and topology and in essence the infrastructure to support the MapReduce technique. This was Google's proprietary system (sadly, without a name). The documentation provided by Google was used by open source project leader, Apache to develop Hadoop (an open source implementation of Google's proprietary MapReduce system). Later, research scientists at Yahoo did a significant amount of work to give Hadoop the shape it has attained today. Now, that's just for an intro to Hadoop and to tell you guys where it came from and who is to be credited with its development.

The Hadoop system comprises of two key concepts to serve two different purposes: HDFS (Hadoop File System) that takes care of the storage of data and MapReduce concept which handles the processing and computation in a manner described above.

In the Hadoop HDFS system, the data which is going to be accessed is stored beforehand in a way that supports the MapReduce paradigm. The data is actually fragmented into blocks of size typically, 128 MB, 256 MB, 512 MB or 1 GB. Now, the nodes I talked about above have their own disk storage. This is to say that when the nodes perform computations, they do not have to fetch data remotely from another source or sources. When the nodes compute, they only need to access the data resident on their own disks. Typically, each block of data is replicated three times (at three nodes) for making the system fault tolerant. A directory mapping each block to its location (nodes where the block can be found) is held by a manager node called NameNode.

Now, when you search using the keyword, 'apple' in Google search engine, every page or link on the web where an 'apple' substring can be found are brought together and presented to you as the search result. Actually, what is in fact happening at the background is that each of these links containing 'apple' (which are distributed and stored beforehand across thousands of nodes) are aggregated by a bunch of reducer nodes and displayed to you. So what are the mapper nodes doing here? They just store the links or pages and of course undertake some minor processing such as locating the searched keyword in the page and forwarding their results to the reducers.

What next? I just want to answer one final question. Now, why is there all this hype about Hadoop and MapReduce? It is because they can handle large amounts of data of the order of terabytes to petabytes which is otherwise practically impossible. Hadoop is scalable, i.e. as the amount of data goes on increasing, you just keep increasing the number of nodes and there you are, with complete access to the data accretes everyday. One more thing: If not for cloud computing techniques like Hadoop, we wouldn't have many businesses today. Twitter, facebook, LinkedIn all use Hadoop. Small businesses generally do not have the capital to invest on thousands of computers with disks and processors. Cloud computing services enable these small businesses to pay a monthly rental to avail their service, thereby encouraging many promising and enterprising small businesses. (Twitter and facebook wouldn't be here today if not for these cloud services.)

So who has invested in the cloud infrastructure? I have heard only about 2-3 firms like nivio, nebula and rackspace that rent commodity hardware to customers for a rental fee. I do not want to ramble any further and I want to end this post as quickly as possible, but I do not want to end it all that abruptly, so I will conclude by telling where this is taking us. In the times ahead, IT solutions are going to be largely cloud based and whoever undertakes research in cloud computing infrastructure will be heavily rewarded. So keep following cloud technologies and you would most likely be a billionaire of tomorrow!

References:
[VS1] http://www.youtube.com/watch?v=SS27F-hYWfU&NR=1&feature=endscreen

[IS1] http://www.google.com/imgres?num=10&hl=en&tbo=d&biw=1241&bih=606&tbm=isch&tbnid=pkDoG08cRc9FiM:&imgrefurl=http://www.pnexpert.com/Analytics.html&docid=GQpUqUQTguOKUM&imgurl=http://www.pnexpert.com/images/Hadoop.gif&w=603&h=320&ei=GQzdUNauJKbV0QGfzYGABg&zoom=1&iact=hc&vpx=4&vpy=282&dur=968&hovh=163&hovw=309&tx=169&ty=90&sig=109939011338169087721&page=1&tbnh=140&tbnw=266&start=0&ndsp=18&ved=1t:429,r:6,s:0,i:149