Wednesday 16 July 2014

No implicit type cast between Integer and Double

It is interesting that a collection once defined as List <Double> does not accept Integer objects.

Message: No implicit type cast from Integer to Double.

Tuesday 10 June 2014

Understanding how Immutable variables really help make your programs thread-safe

I have read some times before that choice of immutable variables over mutable ones is a good idea from the perspective of building thread-safe applications. It's not as if I did not understand this. But, there were some intricacies in the basics that I had trouble recollecting which is why I thought I would write it down here.

Talking about shared memory, data consistency, visibility and synchronization when you have multiple flows of control in your program; it is important to recollect that these are really issues only when the values change. Also, for values to change, you need write operations. So, it is easy to infer that when values do not change, there are no write operations at all. Thus, we need to remember that concurrent/even simultaneous reads by multiple threads of a value that does not ever change is actually not a problem at all!

I asked myself this question recently when I was trying to access an array with multiple threads, each thread accessing a different index of the array (mutually exclusive access). As simple as it is to understand that it is a thread-safe operation, it took me a while to actually be sure that I was doing right.


Friday 30 May 2014

Do away with System.out.println and Logger.info and your code now runs really faster!

I wrote a 10-threaded script last evening and was astounded to note that the performance hadn't really improved as I had expected. I knew for sure that I had parallelized all computation to get that 10x boost in speed from my initial single-threaded code and I was struck by this weird idea that 'sysout' and Logger.info were to blame. 

My good friend told me that if I were printing feedback out to the console on every single operation, that was a good reason why my code didn't perform as had been anticipated initially. I removed all those statements and bang! the code now run smooth and fast. 


Thursday 29 May 2014

Arrays are implicitly thread-safe

I am currently tasked with writing to a 2-dimensional array from multiple threads. Let me briefly explain the context of the problem.

I have a 2-D array which looks somewhat like this:

__ __ __ __ __
__ __ __ __ __
__ __ __ __ __
__ __ __ __ __
__ __ __ __ __
__ __ __ __ __
__ __ __ __ __
__ __ __ __ __

So, what I had to do was use 8 threads, one each to write to each row of the array. I was thinking whether I would need to synchronize write access to the array wherein each thread is writing to a separate memory space independent of the other threads. While the array itself is common for the threads, we can see that each thread is in fact writing to a different space and no thread is dependent on any other thread for reads/writes.

As I had expected, you do not need synchronization in such a scenario. While it may sound really trivial, sometimes basic concepts like these keep you thinking and wondering till you figure out that common sense is indeed the best friend.


Thursday 22 May 2014

Idea behind setting a color to transparent

Incredibly fascinating but works:

To set a color to transparent, we need to :

1. First render the color in 4 channel ARGB model

A (alpha channel) -> MSB  (bits 31-24)

R (Red) -> (bits 23 - 16)

G (Green) -> (bits 15 - 8)

B (Blue) -> (bits 7 to 0)

2. Then AND the ARGB color (argb) with 0x00FFFFFF and voila there you are!

What this does is maintains the color itself in RGB model while setting the alpha channel to zero.

In Computer Graphics, what all an zero alpha channel means is do not render the color! This functions like a directive to image formats like .png giving it the flexibility to either render or not render a specific color. Also note that JPEG cannot handle color transparency for its color model consists of 3 channels only (RGB).






Wednesday 14 May 2014

StringBuffer re-initialization with setLength method

Often, we need to re-initialize a string with a value other than the one it was initialized with. Using StringBuffer here is recommended if the original string is not needed once its value has been changed.

A common way to re-initialize a StringBuffer obj is to write:

obj = new StringBuffer ();

What this does is create a new object itself and reassigns the reference obj to refer to the new object.

An optimized approach of achieving the same goal is to set the length of the obj to zero.

obj.setLength(0);

This does not create a new object but instead resets the value of the same StringBuffer object, obj.





Singleton Nested Class Holder idiom tamed!

Singleton Nested Class Holder idiom works by on-demand initialization of singleton objects during class loading. This gives the impression that it does not work when the object initialization depends on another resource for initialization. Here, we show that is not the case with an interesting example.

For example, here is a stream reader class to initialize a singleton stream reader object that could then be used to parse text contained in a URL. The URL to be parsed is input at run-time by the user.


/*attempt to make reader a singleton */

public class Reader {


private DataBean data;
private InputStream ir;

private Reader (DataBean d) {
data = d;
try {
ir = new URL(data.getResource()).openStream();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

********\ Oh oh, Databean not available yet **********/
private static class ReaderHolder {

private static final Reader reader = new Reader();

}

public Reader getInstance (Databean d) {
               return ReaderHolder.reader;
        }
}

Now, the problem here seems to result from the fact that the Databean object is not available yet as it will only be made available later by the user.

We work around this with a Context object that encapsulates the Databean object and may be either be passed around easily or making Context itself a singleton with early-initialization.


/*Context class*/
public class Context {

public static final Context INSTANCE = new Context();

private DataBean dataBean;

private Context() {

}


/**
* @return the db
*/
public DataBean getDataBean() {
return dataBean;
}

/**

* @param db the db to set
*/
public void setDataBean(DataBean db) {
dataBean = db;
}


}


Now, our reader is modified to look like this:

/*Reader is a singleton */
public class Reader {

private InputStreamReader ir;

private static class ReaderHolder {

private static final Reader reader = new Reader();

}

private Reader () {

try {
InputStream stream = new URL(Context.INSTANCE.getDataBean().getResource()).openStream();
ir = new InputStreamReader(stream);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

/**
* @return the ir
*/
public InputStreamReader getIr() {
return ir;
}

public Reader getInstance () {
return ReaderHolder.reader;
}


}



This is actually an interesting enhancement opportunity to plain Java programs derived from J2EE and Spring framework.








Some high-level Hadoop information!

1. It is a good idea to minimize the amount of data transferred between Mapper and Reducer in keeping with the Bandwidth constraints of the network. Thus, a combiner function is employed at the mapper to combine or aggregate mapper output before it is passed on to reducer.


2. Job profiling and tuning a job: make a Hadoop job run faster
3. Job tracker and Task tracker - > Job tracker assigns  tasks to one or more task trackers that actually run the job on its own split.


4. Hadoop Streaming: An interface provided by Hadoop framework that enables programmers to write MapReduce code in any language that supports standard streams. The technique that underpins this ease/convenience is Hadoop Streaming.


5. What is pig? A high level scripting language to write Hadoop programs faster and easily. It automatically maps the problem in MapReduce mode, saving developer efforts.

Sunday 6 April 2014

Heads up tips on Boolean Retrieval

Here's an excellent page to get started: http://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html

Now, here are some heads up tips:

1. In Fig 1.1, we see a natural boolean representation of the postings list. Each row contains a binary 1-0 sequence, with '1' telling us that the term is present in the document and 0 suggesting that it is not present.

As we typically work on billions of documents on the web, it is a great idea to employ memory optimized solutions. One such solution is to store only the documents that contain the term in the postings list. 


2. When we evaluate a query, for example, Brutus AND Caesar AND Calpurnia,  it is a great idea to evaluate Brutus AND Calpurnia first. As we can see, the documents containing all three keywords will be best determined by the one that occurs least frequently. This way, we can eliminate irrelevant documents rather smartly in the first step itself.

3. How about evaluating (Brutus OR Calpurnia)?

An OR operation deserves some special attention to itself. Of course, it simply evaluates a Union of the two postings lists. Here is a smart way of achieving the same result.

Set either of the two postings lists as the solution,  in technical terms, initialize the solution as any one of the two postings lists. Now, you could use a set type for storing the solution. Then traverse the other posting list from start to end and add it to the solution set. When you will have traversed the entire second list, you get the solution set. The use of set type eliminates the possibility of storing the same document id more than once. 

The above method of computing Unions also finds an interesting application in the polynomial addition.


Wednesday 12 March 2014

Left to Right binary exponentiation

I did study this algorithm last year when I took my advanced algorithms course but I always missed the forest for the trees.

It is said that left to right binary exponentiation is faster than naive multiplication. For example, x^108 = x*x*x.... (108 times), now, that's 108 multiplications.

A wee bit faster technique is left to right binary exponentiation. x^108 = (x^54)^2 = ((x^27)^2))^2 = ((x* (x^13)^2)^2)^2 and so on.  Apparently, this achieves the same result in logN multiplications, more precisely, 5. I wondered what sense it actually made and I was always told that squaring was not to be treated as multiplication. In technical terms, I thought squaring was just as computationally intensive as multiplication. In software terms, I was right but I was wrong otherwise, in hardware respect.

Actually, squaring a number is equivalent to a 1 bit left shift of its binary representation. That can be achieved using bit shifters and that's where the speed up is achieved. This is programming in such a way so as to exploit hardware capacity.


Tuesday 11 March 2014

Where Algorithmic Complexity is not a function of the number of loops


When we take a look at the above algorithm, we can see that there is an outer loop that runs from 1 to n, where n is the number of vertices. For each vertex, i, the algorithm calls DFS routine if i is not explored yet.

Here, the outer loop runs n times, and each inner loop examines on an average of the order of n vertices, but that said, the algorithmic complexity is not O(n^2).

While that may be how it looks like, on closer inspection, we can see that the outer loop doesn't invoke the DFS subroutine if the vertex i is already examined. So, here, for the cases where i is already explored, the work done by the outer loop is in fact, O(1). 

So, for example, the complexity is more like O(n/4) + O(n/2) + O(n/8) +  O (n/8) + m * O(1) which is O(n) or linear.

BFS for Undirected Graphs

Both BFS and DFS may be used to compute connected components of an undirected graph. 



BFS

Here, s is the source vertex.  We use a queue and put s into it. Remove s first, mark as explored and insert into it a and b when we remove s. Recall that we can use an adjacency list to maintain a list of neighbours of a given node. Let us assume, a is put in first and then b.

Queue state: a,b

Now, a is taken out, marked as explored and it's adjacent nodes, c,b,s are inserted back into the queue. Queue : b,c,b,s

Take out b, mark b as explored and put it's adjacent nodes, a,c,d,s into the queue.

Queue state: c,b,s,a,c,d,s

Take out c, mark as explored and put it's neighbours, a,b,d,e into the queue.

Queue state: b,s,a,c,d,s,a,b,d,e

Take out b. But b is explored so simply proceed to the next queue element.
Take out s, s was the first item to be explored. So go on to the next element.
Take out a, explored so proceed.
Take out c, explored so proceed.
Take out d, unexplored, so mark as explored and put it's neighbours, b,c,e into the queue

Queue state: s,a,b,d,e,b,c,e

Take out s, explored so proceed.
Take out a, explored, move on.
Take out b, explored, move on.
Take out d, explored, so move on.
Take out e, unexplored, mark as explored and put it's neighbours c,d into the queue.

Queue state: b,c,e,c,d

Take out b, explored, move on.
Take out c, explored, move on.
Take out e, explored, proceed.
Take out c, explored, so proceed.
Take out d, explored, proceed.


Queue is empty. BFS completed.








Abstract classes and Interfaces

Both abstract classes and interfaces facilitate polymorphism but we see their differences here and then move on to examining valid use cases for each.

1. Abstract classes may not be instantiated, this for a fact, holds true for interfaces as well.
2. Abstract classes may have implementations for some of their methods, but interfaces have no method implementations at all.

So, that's essentially one difference only. But this critical difference also governs their suitability and applicability in programming contexts.

1. Use abstract classes if you anticipate some level of versioning for your components. To explain briefly, if you anticipate that your implementations will all have newer features tomorrow, use an abstract class for it may be updated to correspond to the newer functionality. Interfaces, on the other hand, once created cannot be updated and you will have to create a whole new interface.

2. Use abstract classes if you believe there will be some level of identical functionality among all implementations of your component. This is because, the common functionality may be implemented in the abstract class method  whereas, interfaces do not support any implementations at all.

3. It's a good idea to use an abstract class to provide common functionality among all implementations of an interface. One manifestation of this idea lies in the abstract classes of  Java Collections. 

Friday 7 March 2014

HTTP Get and Post Methods

Use Get only for safe and idempotent operations. Safe operations are those which do not update a state, for example, a database table. Idempotent operations are those which return the same result no matter how many times you call them. 
It's evident that Get is suitable for retrieving the result of a query and in this regard, similar to a SQL select.

Use Post where the operation is neither safe nor idempotent. For example, imagine you want the client to register on your registration page. Here, if the registration succeeds, a new entry will be added to the database table. In this case, of course, the state is being modified and a Post would be the better thing to do. We can see that such registration operations are not idempotent either. While it is true, that the system would not permit a user to register with the same credentials over and over again, this is a business logic concern. In theory, a database would not disallow two objects to have identical values. We can see that a Post is similar to a SQL insert.

There is also a Put operation that is suited to unsafe operations that are idempotent but I will not discuss that here.

As for other differences, a Get causes the url to bear the entire query and as such not suitable for transmitting sensitive data like passwords. In contrast, the query is transmitted invisibly with Post, so it may be used to safely transmit passwords and other information that require some element of confidentiality. There is also a limit to the amount of data that Get supports. There is apparently, no such limit for Post request.


Randomized selection to return the i'th order statistic

Suppose you have an unsorted array of size N and that you want to determine the ith order statistic for the same array. This is nothing but the ith smallest element in the array. While a simple way to do this would be to sort the array using an efficient sorting technique like say merge sort in O(NlogN) and then return the element in the ith position in O(1) time, apparently, there exists a much faster algorithm to achieve the same end. It's called QuickSelect.

int quickSelect ( int[] a , int lb , int ub , int i ) throws Exception {

if ( lb >= ub )
throw new Exception ("invalid i");

p = partition ( a , lb , ub );

if ( p > i )  //look to the left
    quickSelect ( a , lb , p-1 , i );

if ( p < i )  //look to the right
    quickSelect ( a , p + 1 , ub , i - p );

if ( p == i )
    return a[p];

}

Now, what is randomized here is the selection of the pivot in the partition routine. The pivot for this pass of the quickSelect is chosen randomly among ub-lb elements of the array and the partitioning is done around this pivot.

The difference between quicksort and the quickselect routine is that while quicksort recursively applies the partition toboth halves of the sub-array, quickselect determines which side of the pivot the ith order statistic resides and applies the recursive routine to only that side of the sub-array.

T(n) = O(n) + T(n/2)
While this may be suggestive of O(nlogn), it is not. It reduces to 8cn which is in fact O(n).





QuickSort Variants

QuickSort is a rather well-known sorting algorithm and for many reasons I think it is also fairly deceptive. There are several variants of it and I think each one is interesting in its own right.

Version 1

void quickSort ( int[] a , int lb , int ub ) {

if ( lb >= ub )
    return;

int pIndex = partition ( a , lb , ub );
quickSort ( a , lb , pIndex );
quickSort ( a , pIndex + 1 , ub );

}

int partition ( int[] a , int lb , int ub ) {

int pivot = a [ ( lb + ub ) /2 ];

int left = lb;
int right = ub;


while ( left <= right ) {


          while ( x[left] < pivot ) 
                    left++;

          while ( x[right] > pivot )
                    right--;

          if ( left <= right ) {
                    int t = x[left];
                    x[left] = x[right];
                    x[right] = t;
                    left++;
                    right--;
          }
}

int p = right;
return p;
}

Note: This one does work but note that for some inputs, the split may not be precise. Example: Consider the case where 8,4,1,7,6,2,3,5 is input and 1 is chosen as the pivot first. After 1 pass, 1 would be in its right position, but say 7 is chosen thereafter. Here, in the subsequent pass, 7 will be in its appropriate position but the implementation passed the sublist 7,8 yet again. While this may not incur significant performance penalty, it's worth a note.

Version 2: Canonical QuickSort

int partition ( int[] a , int lb , int ub ) {

int pivot = a[lb]

int i = lb+1;

for ( int j = lb+1; j <= ub; j++ ) {

if ( a[j] < pivot ) {
    swap (a[i], a[j]);
    i++;
}

swap ( a[lb], a[i-1 ); 
}

What I did like about the canonical version was its inherent simplicity. It also makes apparent that each partition has a computational complexity of O(n). As for a counter example for this one, there isn't.

Version 3: Another canonical scheme

int partition ( int[] a , int lb , int ub ) {

int mid = ( lb + ub )/2;
swap ( a[lb] , a[mid] );

int pivot = a[lb];
int left = lb + 1;
int right = ub;

while ( left <= right ) {

while ( a[right] > pivot )
          right--;

while ( left <= right && a[left] <= pivot )
          left++;

if ( left < right ) {
       swap ( a[left] , a[right] );
       left++;
       right--
}

}//end while

swap (a[left],a[right]);
return right;

}

This works fine too. The right returned is correct and it does not work on sub-arrays with pivot already selected previously.






              




Thursday 6 March 2014

JSP <=> Servlets

A servlet is in essence, a Java class that is executed by the server in response to a client request. A JSP file on the other plays many different roles in the context of a web application. For one role, it could simply be a view that renders the server side response in HTML, XML, JSON or other formats. For another, it could be the intermediary code like ASP or PHP that does some kind of input validation and thereafter forwards this input data to a server for persisting the same in permanent storage. Whatever the role, a JSP is compiled into a servlet and the JSP scriptlet code itself constitutes a part of the doGet() or doPost() methods of the servlet. 

What one needs to internalize here is this interesting relationship between JSPs and Servlets for the former eventually gets rendered into the latter. This leads to a critical inference regarding JSPs and Java classes. A servlet being a Java class, the ultimate form that a JSP takes is a Java class. This relationship between JSPs and Java classes leads to yet another interesting result. A class is composed of member functions and data, so a JSP that eventually gets compiled into a class must provide some mechanism to declare and define class members. The answer is of course, yes. Using the declaration tag, one may declare and define class members in the very same way as in a Java class itself. An example is shown below:

<%!

private String name;
private Integer roll;

public void setName (String name) {
this.name = name;
}

%>

Now one may call a function as shown above from a scriptlet as shown below:

<% 
setName ("Joe");
%>

It's important not to miss the forest for the trees. The crux of the idea is that every web servlet is nothing more than a simple class and the information being passed around is nothing more than a simple object. When you learn object-oriented programming for the first time, it is not very uncommon to read that object passing facilitates message communication. While then it appealed to the senses as nothing more than some kind of an abstract fantasy, we are now seeing rather elegant manifestations of that concept.






Forwarding and Redirecting

While both redirect and forward seem to show the same effect, they are in essence two extremely different things:

Forwarding is done when a client submits a request for a page (url) and the server side code assesses that the page should not be accessed directly by the client in such fashion. Imagine an account dashboard page that shows up after a login. Such a page, let's say, "www.domain-name.my-account.html" should only be displayed after a successful login in the login page, say, "www.domain-name.login.html". If the client attempts to bypass login and access the dashboard directly, the natural thing to do for the server would be to forward the user to the login page where he would be prompted to login and thereby access his dashboard. Note that forwarding takes place entirely on the server side and it does not involve the browser at all. In fact, the browser would not even be aware of the actual page forwarding.

Client - > Server : Access page B
Server -> Client : Shows page A as the requested page, B 

Redirect is done when a client requests a page and the server sees that the requested page is no longer accessible may be on account of page expiration or on account of change of url or even absolute removal of the requested page. In such cases, the server responds with a page that mocks the requested page but that includes a header that may be decoded by the client to access a different page altogether. The client browser then needs to make another request to access the specified page. Note that client side browsers are aware of redirects unlike forwards.

Client - > Server : Access page B
Server - > Client : Page B | Header: Page C
Client -> Server : Access page C
Server -> Client : Page C

This distinction also has another interesting ramification. If the client browser has disabled cookies, then cookies may not be sent to the client. Thus session information cannot be transmitted between successive client requests in the regular way. This browser behavior does not however affect server-side forwards .This may be worked around by URL encoding which essentially involves embedding the session id in the URL to maintain session state. 




JSP Include directive and Dynamic include

The interesting difference between the include directive and the dynamic include is that the former is a static import and the latter is a dynamic import. 

To explain, the include directive is similar to a macro call while the dynamic include is like a function call. While a macro call simply inserts or includes the contents of the macro definition in the body of the program where it is called at compile-time, a function call is more than a mere include. Function calls perform some kind of computation on the data that is passed to them as arguments. 

In our own context of JSPs, use of the static import using <@include file = " " /> essentially inserts the contents of the file specified at that place and compiles it with the JSP into a servlet. The dynamic import using <jsp: include page = " " / > on the other hand, compiles the included JSP page as a separate JSP and embeds a call to it at the point of dynamic import statement.


Wednesday 5 March 2014

Heads up tip on Java web application deployment

It's not unusual to compile your Java web application with JDK 7 and then deploying it on a platform that runs it on an earlier version of Java or for that matter, a later version.

Make sure that you compile your application with the same JDK version as that used by the platform that you deploy it on.  

Java String overrides Object.equals

The Java String object extends the Java Object like any other Java object does. It must however be noted that the Java String overrides Object.equals method to compare two String objects character by character.

Let's briefly look at the Object.equals code:

public boolean equals(Object other)
{
   return this == other;
}



We can see that Object.equals compares the references of the objects themselves and as such returns true only if the two references are in essence referring to the same object in heap. The equals is overriden by Java String to return true if the two objects are logically equal.

Consider,
String a = "abc";
String b = new String ("abc");
String c = "abc";

System.out.println (a.equals(b));
System.out.println (a.equals(c));

System.out.println (a == b);
System.out.println (a == c);

The output would be:

true
true
false
true

Getting started with caveofprogramming.com

For beginners and intermediates alike, I would recommend that you check out these awesome video tutorials by John Purcell:

http://www.caveofprogramming.com/

Tuesday 4 March 2014

Anonymous classes

Over the last week, I have been working with Java multi-threading in great detail. During this time, I started passively working with anonymous classes, all the while, without realizing that this was yet another worthwhile Java concept. 

Anonymous classes enable you to write programs that are concise. While a regular class has a name, anonymous classes don't. Just look at the following segment of code:

Thread t = new Thread ( new Runnable() {

public void run () {

/*do something */
}

});

Now, we can see that the Thread constructor is passed an instantiated Runnable with the action to be performed in tact and encapsulated in the "run" method.

What we are in fact doing is creating an instance of a class that implements the Runnable interface and then declaring and defining the methods of this class. This is the idea of anonymous class. It's particularly useful if you do not wish to instantiate a class more than once. 


Monday 3 March 2014

Key note on Java Generics

1. While you may safely assign a generic type object to a raw type, the converse is not true.
e.g. : Box <Integer> b = new Box <Integer> (); //Generic type
Box newBox = b;

This is legal. As for the converse, Box newBox = new Box ();// raw type
b = newBox;

The above doesn't result in a compile-time error but leads to a compile-time warning. Note that raw types bypass generic type checks at compile-time.

2. When using bounded generic type parameters, note that you may invoke method of the class being extended.

e.g. :
public class Box <T extends Integer>  {
     
       public T getDouble (T t) {
              return t.intValue() * 2;
       }
}
This is to note that the application of intValue() as shown in the preceding example is legal.

3. From 1, we can see that it is perfectly legal to cast an Integer to an Object. This is by virtue of their super type - sub type relationship or "isa" relationship as it is sometimes called. This does somehow lead to a common misconception which is shown below:

public void someMethod ( Box <Number> boxN ) {
/*do something */
}

Now, the parameter boxN here accepts any Box <Number> object. It must be noted though that boxN cannot accept objects of Box <Integer> types even if Integer extends Number. This is because Box <Integer> is not a subtype of Box <Number> and this is true even if Integer itself is a subclass of Number.





Sunday 2 March 2014

Thread.sleep() and Thread.yield()

An interesting comparison to make with no real loud, resounding solution at all.

Thread.sleep(long millis) typically causes the currently running thread to go to a sleep state. What this essentially means is that it actively uses CPU cycles to sleep for the duration specified in millis. After this sleep, however, there is no guarantee that the thread will continue to execute and may be moved to runnable state where it will join the queue of runnable threads waiting for execution.

Thread.yield() causes a current thread to defer its execution state to another thread of equal or higher priority. Apparently, the JLS does not provide guarantees that this thread will indeed be moved to the runnable state from running state.

It must be noted that neither of these methods result  in a thread forfeiting its locks, so they should be used with caution.

Immutable and Final

It is a known fact that String objects are immutable? What does this mean? Does this mean that String objects are implicitly final?

  • String objects are indeed immutable. Now, this means that once a String object has been initialized, it's state cannot be changed.

String str = "India". Once such an assignment has been made, you can consider the heap allocated with an object "India" as shown with str being the reference.



Now, if we make an assignment, str = "China", what we are in fact doing is reassigning str to another object, China on the heap.




Here, the reference str has changed value. It was formerly pointing to object with state "India" and now it has been reassigned to refer to object with state, "China". In fact, the assignment of form, str = "somestring" creates a new string object on the heap each time. This is why String object is said to be immutable, immutable in that once initialized with a value, it can never change its state.

  • What if we had initialized with the final keyword:  final String str = "India"?

What we have done now is made the reference str immutable. 'Immutable' may not be the best choice of a word here for what I had intended to say was that we have made the reference a constant, constant in that it will now point to the object "India" on the heap and can never be reassigned to refer to any other object on the same. In fact, the statement, str = "China" is now illegal.

  • Why is it better to initialize a string object with the statement, String str = "India" as against String str = new String ("India")

Well, apparently, String is implemented in Java using a String pool. Now, what happens here is that once you create an object using the statement, str = "India" and subsequently reassign str with the statement, str = "China"; the former object "India" on the heap loses its reference but is not immediately garbage collected. If you then had to create another string with the former value "India", the already created object is reused. This obviates the need to recreate already existent objects. This is in fact one of the reasons why immutable objects do not provide the clone() or copy constructor method.

Now, let's say you initialize a string using the statement,
String str = new String ("India"). The above idea of deferred garbage collection applies here as well. But if you happened to write another String str2 = new String ("India"), Java would in fact create yet another object with value, "India" on the heap. This is to say that the existing String object would not be reused with a duplicate reference. This may be visualized as shown below:



Such duplicate object creation may not be in keeping with the space efficiency of the program as evidenced by the above example. A simple way to check the above result would be to compare str with str2: str == str2 which would return false. The object hash codes would also be different as they are two different objects. It must be noted that str.equals(str2) would return true.

Just another subtle addition to the above note. This scheme of deferred garbage collection applies only with some object types, String being one of them. For those to which it does not apply, (Integer, for example), the two initialization statements are in fact, equivalent.



Wednesday 26 February 2014

Misconceptions about Object.finalize()

I have been studying Java for about 6 years now and all along I have dissected something interesting about it at various levels. To be more general, I think the beauty of OOP lies in its inherent simplicity to hide relevant complexities and one does not unlock these complexities till the time he explicitly starts working on them.

I had always learnt that the Object.finalize() method is used to clean up resources before an object is garbage collected by the JVM. My professor simplified it all the more for me and my then innocent friends who were led to believe that in simple terms, "Object.finalize() is just the same thing as Destructors in C++". Woah, now, that's one of those instances of reckless simplification founded upon specious reasoning.

I will just explain what it really is and nothing more. Object.finalize() is a function or a method that may be overriden by any Java object (provided its class implementation provides the necessary overriding definition, that is) and this method is called by the JVM when it notes that no more references exist to this object and that it may be safely garbage collected. 

The finalize() method definition itself may include code  to cleanup resources that are not going to be used, say, file.close() method, for example. If the method has been overriden, then the JVM invokes the finalize() before the object itself will be deallocated, that is, the finalize() is in effect, the last operation that will be performed by an object along its lifecycle from birth to termination.

Now there is an essential caveat here that needs special attention. For one thing, never rely on finalize() to do anything that you believe is critical from the system functioning perspective. Well, that is because Object.finalize() is called by a JVM GC (Garbage Collector) thread while the JVM itself provides no real guarantees in regard to whether the finalize() will ever be called. 

Now, what does this really mean? It does sound fairly paradoxical in that on the one hand, you hear it will be called by the JVM before the object will be deallocated and on the other, you note that the JVM cannot guarantee that it will be called ever. Well, here is where a little extra information can help. Well, we must recollect that we can never ascertain when exactly the JVM does its garbage collection or to be more clear, one can never guess when the JVM GC thread itself will be scheduled and eventually dispatched. So, it is safe to assume that at times, the application or the program itself will run from start to completion and the JVM GC thead has not managed to run during the application program life cycle. In such cases, those objects used in the program will only be deallocated after the program itself has terminated and when the JVM GC thread will on its part, reclaim the space used by those concerned objects. So, it's now evident here that the finalize() has not been executed for the program has terminated prior to its invocation by the JVM.

So when do we really use finalize()? Well, the general idea is that do not use it to accomplish tasks or execute operations that determine the fate of the application itself. It is safe to use it to accomplish less critical functions, for say, logging information that could later be used for debugging. Here, the key idea is to never really rely on it to do anything, for that may not necessarily serve the purpose.

Now, with regard to some drawing parallels with the the C++ destructor, I must add that the destructor is guaranteed to be called during the lifetime of the program. So yes, you may use to perform some last minute work that will need to be done prior to destruction of the object itself. It's the presence of this guarantee in the case of destructors and the absence of it in the case of finalize() that makes all the difference. 

Monday 24 February 2014

Insight into Web Services for beginners

There have been a lot of questions I have been asking myself ever since I started studying computers in my undergrad. I got answers to some of them pretty early while some lingered on in my mind. Fortunately, over the last two years, I have slowly started uncovering answers to more of them.

What is a web service?
This is a pretty interesting question, actually. Now, looking at the two words in isolation, "web" and "service"; one would most likely understand that as a service offered over the web. Well! Yes, but not quite complete and convincing! This is with due consideration to the buzz around the terms when you naively start reading tech posts and keep wondering all along as to what justified the writer's extreme excitement and optimism while talking about something that sounds as trivial as a "web service".

That surely made me understand that there was a void in my understanding of "web service" and it surely is something much more than a service offered over the web. Back then, about 4 years back, I happened to ask my professor about the same thing and I could see that even this was not really helping me look at the thing from a wider perspective and attain that big picture I was looking for.

This morning, I happened to stumble upon this fine post, http://www.dummies.com/how-to/content/getting-a-look-at-web-services.html.

As it suggests, a service offered over the web that may be freely used by another application is called a web service. Sometimes, you may want to use a particular service exported by another service in a specific component of your application. Imagine, you would like to provide a search facility in your website and would like to incorporate the Google Search toolbox for the same reason. How the Google search engine works is of no concern to you and all you care about is that the Google Search toolbox accomplishes its job. Now, the search service provided by Google is an example of web service that is freely exported by Google and may be imported by anybody for use in his/her application.

In case you are a web programmer, you may be familiar with REST and SOAP calls. Typically, the web service is queried by the importer application using the REST or SOAP calls and the response of the web service called service response is rendered and shown to the user. These days, a format such as JSON (JavaScript Object Notation) is used to return the service response and this is rendered by a view-level language such as JSP. For those familiar with AJAX, it is a robust framework and technique that may be used to make the web service call and wait on its response.


Revisiting the UNIX Process


A process in UNIX, or for that matter, any operating system may be defined as a program in execution. While the statement is fairly complete and clear in itself, there are some other things about processes that I will highlight today.
1. Program and Process
A program is in essence, an executable sequence of code that is run by the processor. The program or the code is not the process itself. The program containing the process code constitutes only the content of the text segment of a process. The process itself is more than the program and the following figure is an illustration lets us get a deeper picture of this idea.


Process in memory [1]

We can see from the above figure that a process in memory comprises various segments such as the stack, heap, data and text (code).

Stack segment: Holds references to local variables in the process. Generally, the stack is further split into several stack frames, one for each function or method in the process. Also recall that each thread of the process has its own copy or instance of the stack that is not visible to other threads.

Heap: Dynamic portion of the memory assigned to a process that is used to allocate memory to objects on demand during their creation. The heap memory is divided into arenas, one for each thread of the process. The arena of one thread is not visible to another thread, but it is visible to all functions and methods in the same thread.

Data: Holds global variables shared between various threads of the process. The program or code is the passive entity containing the lines of instructions governing the execution by the processor of the active entity, process that resides in memory and undergoes several transitions during its lifetime, since its creation until its death.

·        The process may also be viewed as an instance of the program in execution. This is similar to the idea of classes and objects. Different processes are different instances of the same program that share the same code segment and have their own independent heap, stack and data segments.

2. Process representation in operating system
Each process is represented using a data structure called the Process Control Block or Task Control Block (PCB). The PCB has information regarding a process such as its identifier (pid), the identifier of the parent process that created it (ppid) state (new, ready, running, blocked, terminated), reference to the next instruction to be executed for the process (IP), the list of files opened by the process, the reference to the process address space (this is OS dependent and this is typically a memory management concern), the values of the CPU flags, the remaining amount of time for executing the process (in a time slice) and other process specific information that needs to be logged. 

A typical PCB in Linux is represented using a structure, task_struct as shown below:
struct task_struct {
pid_t pid;
long state;
unsigned int time_slice;
struct files_struct *files;
struct mm_struct *mm;
}
[2]

The operating system maintains a doubly linked list of PCBs. This is typically used while doing a context switch from one process to the other. When the CPU switches from the execution of one process to another, the current state of the PCB of the process in execution is updated and saved and the PCB of the process which is going to be executed is loaded in memory. It is vital to understand the role of memory management schemes in this context. It is of particular importance if physical memory constraints prevent the maintenance of some PCBs in memory when the processes they correspond to are currently not in execution. In such cases, it is safe to assume that the operating system maintains references to PCBs (using virtual memory addressing techniques) if not the PCBs themselves. 

3. Fork, Memory Overlaying, Zombie state
A process in UNIX/Linux may create a new process using the fork system call. Upon execution of the fork call, the operating system creates a new process and loads the address space of the newly spawned child process with the current state of the address space of the parent process. This is similar to object cloning in programming.

Upon successful invocation of the fork(), a new process is created and its process identifier is returned to the parent whereas the child process itself is returned 0.
Typically, after creation, the new process is tasked with some objective as reflected from its own code segment. However, keep in mind that the code segment of the newly created process is identical to that of its parent process immediately after its creation as its address space is a mere replica of that of its parent. Now, to get the child process to execute some other instructions and accomplish a different objective, we use system calls in the exec family to which we pass reference to another sequence of code. This is shown in the following program:

#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
int main() {
pid_t pid;
pid = fork();
if ( pid < 0 ) {
fprintf ( stderr , “Fork failed” );
exit (EXIT_FAILURE);        
}
else if ( pid == 0 ) { /*in child */
execlp (“/bin/ls” , “ls” , NULL);
}
else { /*parent*/
wait (NULL);
printf (“Execution of child process is complete”);
exit ( EXIT_SUCCESS );
}

This particular example uses execlp whose signature is
int execlp(const char *file, const char *arg, ...);

We can see that the first argument is pointer to a character stream, the second parameter onward are pointers to strings that serve as instruction arguments. We can see that in our example, execlp is called with the argument,
“/bin/ls” that is itself a pointer to the file containing the instruction, “ls” in the second argument.

Now, upon execution of the execlp system call, the operating system loads the code segment in the address space of the child process with the code contained in the pointer to the code stream referenced in the first argument of execlp and the heap, stack and data segments are also refreshed to reflect to the new program. This technique wherein the address space is entirely replaced without the creation of a new process identifier is called Overlay. It is important to understand here that the changes to the address space of the process are done within the context of the process itself.

Generally, before forking a process, a shared memory Pipe communication is established so as to facilitate communication between a process and its child. This detail has been ignored in the above example.
It is also important to remember that the parent process waits until the completion of the child process using the system call, pid_t wait(int* status). The wait system call is used to prevent the child process being in Zombie state for too long.

In UNIX, when a process finishes executing, the memory allocated to it is reclaimed but the entry of the process in the process table is not immediately removed. Such a process is called a Zombie process for it has terminated execution but is not actually dead. If the parent of the process does not execute wait(), the process would remain a Zombie process for long periods of time (until the death of its own parent when it will be adopted by ‘init’, pid = 1. Once it has been adopted by init, it will eventually be killed as init periodically executes the wait()). The argument to the wait is NULL in our example which means that the status of the child process completion is not stored. Generally, a pointer is passed by the parent to store the value of the status code of the completion of child process.



References
[2]. Silberschatz, Galvin and Gagne (Operating Systems Concepts, 7th edition)