Monday, 24 February 2014

Revisiting the UNIX Process


A process in UNIX, or for that matter, any operating system may be defined as a program in execution. While the statement is fairly complete and clear in itself, there are some other things about processes that I will highlight today.
1. Program and Process
A program is in essence, an executable sequence of code that is run by the processor. The program or the code is not the process itself. The program containing the process code constitutes only the content of the text segment of a process. The process itself is more than the program and the following figure is an illustration lets us get a deeper picture of this idea.


Process in memory [1]

We can see from the above figure that a process in memory comprises various segments such as the stack, heap, data and text (code).

Stack segment: Holds references to local variables in the process. Generally, the stack is further split into several stack frames, one for each function or method in the process. Also recall that each thread of the process has its own copy or instance of the stack that is not visible to other threads.

Heap: Dynamic portion of the memory assigned to a process that is used to allocate memory to objects on demand during their creation. The heap memory is divided into arenas, one for each thread of the process. The arena of one thread is not visible to another thread, but it is visible to all functions and methods in the same thread.

Data: Holds global variables shared between various threads of the process. The program or code is the passive entity containing the lines of instructions governing the execution by the processor of the active entity, process that resides in memory and undergoes several transitions during its lifetime, since its creation until its death.

·        The process may also be viewed as an instance of the program in execution. This is similar to the idea of classes and objects. Different processes are different instances of the same program that share the same code segment and have their own independent heap, stack and data segments.

2. Process representation in operating system
Each process is represented using a data structure called the Process Control Block or Task Control Block (PCB). The PCB has information regarding a process such as its identifier (pid), the identifier of the parent process that created it (ppid) state (new, ready, running, blocked, terminated), reference to the next instruction to be executed for the process (IP), the list of files opened by the process, the reference to the process address space (this is OS dependent and this is typically a memory management concern), the values of the CPU flags, the remaining amount of time for executing the process (in a time slice) and other process specific information that needs to be logged. 

A typical PCB in Linux is represented using a structure, task_struct as shown below:
struct task_struct {
pid_t pid;
long state;
unsigned int time_slice;
struct files_struct *files;
struct mm_struct *mm;
}
[2]

The operating system maintains a doubly linked list of PCBs. This is typically used while doing a context switch from one process to the other. When the CPU switches from the execution of one process to another, the current state of the PCB of the process in execution is updated and saved and the PCB of the process which is going to be executed is loaded in memory. It is vital to understand the role of memory management schemes in this context. It is of particular importance if physical memory constraints prevent the maintenance of some PCBs in memory when the processes they correspond to are currently not in execution. In such cases, it is safe to assume that the operating system maintains references to PCBs (using virtual memory addressing techniques) if not the PCBs themselves. 

3. Fork, Memory Overlaying, Zombie state
A process in UNIX/Linux may create a new process using the fork system call. Upon execution of the fork call, the operating system creates a new process and loads the address space of the newly spawned child process with the current state of the address space of the parent process. This is similar to object cloning in programming.

Upon successful invocation of the fork(), a new process is created and its process identifier is returned to the parent whereas the child process itself is returned 0.
Typically, after creation, the new process is tasked with some objective as reflected from its own code segment. However, keep in mind that the code segment of the newly created process is identical to that of its parent process immediately after its creation as its address space is a mere replica of that of its parent. Now, to get the child process to execute some other instructions and accomplish a different objective, we use system calls in the exec family to which we pass reference to another sequence of code. This is shown in the following program:

#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
int main() {
pid_t pid;
pid = fork();
if ( pid < 0 ) {
fprintf ( stderr , “Fork failed” );
exit (EXIT_FAILURE);        
}
else if ( pid == 0 ) { /*in child */
execlp (“/bin/ls” , “ls” , NULL);
}
else { /*parent*/
wait (NULL);
printf (“Execution of child process is complete”);
exit ( EXIT_SUCCESS );
}

This particular example uses execlp whose signature is
int execlp(const char *file, const char *arg, ...);

We can see that the first argument is pointer to a character stream, the second parameter onward are pointers to strings that serve as instruction arguments. We can see that in our example, execlp is called with the argument,
“/bin/ls” that is itself a pointer to the file containing the instruction, “ls” in the second argument.

Now, upon execution of the execlp system call, the operating system loads the code segment in the address space of the child process with the code contained in the pointer to the code stream referenced in the first argument of execlp and the heap, stack and data segments are also refreshed to reflect to the new program. This technique wherein the address space is entirely replaced without the creation of a new process identifier is called Overlay. It is important to understand here that the changes to the address space of the process are done within the context of the process itself.

Generally, before forking a process, a shared memory Pipe communication is established so as to facilitate communication between a process and its child. This detail has been ignored in the above example.
It is also important to remember that the parent process waits until the completion of the child process using the system call, pid_t wait(int* status). The wait system call is used to prevent the child process being in Zombie state for too long.

In UNIX, when a process finishes executing, the memory allocated to it is reclaimed but the entry of the process in the process table is not immediately removed. Such a process is called a Zombie process for it has terminated execution but is not actually dead. If the parent of the process does not execute wait(), the process would remain a Zombie process for long periods of time (until the death of its own parent when it will be adopted by ‘init’, pid = 1. Once it has been adopted by init, it will eventually be killed as init periodically executes the wait()). The argument to the wait is NULL in our example which means that the status of the child process completion is not stored. Generally, a pointer is passed by the parent to store the value of the status code of the completion of child process.



References
[2]. Silberschatz, Galvin and Gagne (Operating Systems Concepts, 7th edition)


No comments:

Post a Comment