codingfreak

1.16.2012

Notification Chains in Linux Kernel - Part 02

Continuation after PART-1.

Check the PART-3.

Blocking Notifier chains
A blocking notifier chain runs in the process context. The calls in the notification list could be blocked as it runs in the process context. Notifications that are not highly time critical could use blocking notifier chains.

Linux modules use blocking notifier chains to inform the modules on a change in QOS value or the addition of a new device.

<kernel/notifier.c>

186 int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
187         struct notifier_block *n)
188 {
.
199     down_write(&nh->rwsem);
200     ret = notifier_chain_register(&nh->head, n);
201     up_write(&nh->rwsem);
202     return ret;
203 }
204 EXPORT_SYMBOL_GPL(blocking_notifier_chain_register)
.
216 int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh,
217         struct notifier_block *n)
218 {
.
229     down_write(&nh->rwsem);
230     ret = notifier_chain_unregister(&nh->head, n);
231     up_write(&nh->rwsem);
232     return ret;
233 }
234 EXPORT_SYMBOL_GPL(blocking_notifier_chain_unregister);

Notification Chains in Linux Kernel - Part 01

Linux is a monolithic kernel. Its subsystems or modules help to keep the kernel light by being flexible enough to load and unload at runtime. In most cases, the kernel modules are interconnected to one another. An event captured by a certain module might be of interest to another module.

Typically, communication systems implement request-reply messaging, or polling. In such models, a program that receives a request will have to send the data available since the last transaction. Such methods sometimes require high bandwidth or they waste polling cycles.

To fulfill the need for interaction, Linux uses so called notification chains. These notifier chains work in a Publish-Subscribe model. This model is more effective when compared to polling or the request-reply model.

For each notification chain there is a passive side (the notified) and an active side (the notifier), as in the so-called publish-and-subscribe model:

The notified are the subsystems that ask to be notified about the event and that provide a callback function to invoke.

The notifier is the subsystem that experiences an event and calls the callback function.

NOTE: All the code samples are taken from Linux 2.6.24 kernel.

struct notifier_block
The elements of the notification chain's list are of type notifier_block:

<include/linux/notifier.h>

 50 struct notifier_block {
 51     int (*notifier_call)
(struct notifier_block *, unsigned long, void *);
 52     struct notifier_block *next;
 53     int priority;
 54 };

notifier_call - function to execute.

next - used to link together the elements of the list.

priority - the priority of the function. Functions with higher priority are executed first. But in practice, almost all registrations leave the priority out of the notifier_block definition, which means it gets the default value of 0 and execution order ends up depending only on the registration order (i.e., it is a semirandom order).

The notifier_block data structure is a simple linked list of function pointers. The function pointers are registered with ‘functions’ that are to be called when an event occurs. Each module needs to maintain a notifier list. The functions are registered to this notification list. The notification module (publisher) maintains a list head that is used to manage and traverse the notifier block list. The function that subscribes to a module is added to the head of the module’s list by using the register_xxxxxx_notifier API and deletion from the list is done using unregister_xxxxxx_notifier.

Printing logs based on log levels in C

LOG LEVELS ??
As per my definition LOG LEVEL means a way to differentiate the importance of logs in our application. We can divide the logs into categories based on their importance and effect for e.g. ERROR logs are more important than DEBUG logs.

Why do we need to print logs based on LOG LEVEL ??
It is really helpful in projects with millions of lines of source code where the user can't use #defines or #ifdef's in order to maintain DEBUG prints. It is really tiresome to maintain #defines and #ifdef atleast for printing logs.

printk which is a part of LINUX KERNEL supports printing logs based on LOG LEVEL and it is really helpful in debugging kernel.

printf or any of its brothers & sisters don't support the option to print logs depending upon the log levels.

Implementation of Stack using Singly Linked Lists

Stacks are linear data structures which means the data is stored in what looks like a line (although vertically). In simple words we can say

A stack is a last in, first out (LIFO) abstract data type and data structure.

Basic usage of stack at the Architecture level is as a means of allocating and accessing memory.

We can only perform two fundamental operations on a stack: push and pop.

The push operation adds to the top of the list, hiding any items already on the stack, or initializing the stack if it is empty. The pop operation removes an item from the top of the list, and returns this value to the caller. A pop either reveals previously concealed items, or results in an empty list.

A stack is a restricted data structure, because only a small number of operations are performed on it.

strace - diagnostic, debugging and reverse engineering tool

Many times we come across hopeless situations where a program when compiled and installed in GNU/Linux just fails to run. Then we have to trace the output of the misbehaving program. But tracing the output of a program throws up a lot of data and it is a daunting task to go through volumes of data. Still there are cases where we are not fruitful in pin pointing the cause of error.

In this situation strace also known as system-call tracer comes for rescue. It is a debugging tool that monitors the system calls used by a program and all the signals it receives.

A system call is the most common way programs communicate with the kernel. System calls include reading and writing data, opening and closing files and all kinds of network communication. Under Linux, a system call is done by calling a special interrupt with the number of the system call and its parameters stored in the CPU's registers.

Using strace is quite simple. There are two ways to let strace monitor a program.

Method 1:

To start strace along with a program, just run the executable with strace as shown below.

strace program-name

For example let us trace ls command.

$ strace ls
execve("/bin/ls", ["ls"], [/* 39 vars */]) = 0
brk(0)                                  = 0x82d4000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7787000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=76503, ...}) = 0
mmap2(NULL, 76503, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7774000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/libselinux.so.1", O_RDONLY)  = 3
read(3, "177ELF111���������3�3�1���@G��004���"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=104148, ...}) = 0
mmap2(NULL, 109432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x41d000
mmap2(0x436000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18) = 0x436000
close(3)                                = 0
.
.
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7613000
write(1, "01.c  a.outn", 1201.c  a.out
)           = 12
close(1)                                = 0
munmap(0xb7613000, 4096)                = 0
close(2)                                = 0
exit_group(0)                           = ?

In the above example we are not displaying the complete output of strace command. Even though output from strace looks very complicated, this is only due to many system calls made when loading shared libraries. However, once we have found which system calls are the important ones (mainly open, read, write and the like), the results will look fairly intuitive to us.

Method 2:

If we want to monitor a process which is currently running we can attach to the process using –p option. Thus we can even debug a daemon process.

strace –p <pid-of-the-application>

For e.g

#include <stdio.h>
#include <unistd.h>

int main()
{
   sleep(20);
   return 0;
}

We will compile the above code and run it as a background process. Then we try to monitor the program using its process id as shown below.

$ gcc main.c

$ ./a.out &
[1] 1885

$ strace -p 1885
Process 1885 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
exit_group(0)                           = ?
Process 1885 detached
[1]+  Done                    ./a.out

In contrast to a debugger, strace does not need a program's source code to produce human-readable output.

Some handy options

Below example is used in the discussion of other important options supported by strace.

#include <stdio.h>

int main(void)
{
  FILE *fd = NULL;

  if(fd = fopen("test","rw"))
    {   
      printf("TEST file openedn");
      fclose(fd);
    }   
  else
    {   
      printf("Failed to open the filen");
    }   

  return 0;
}

Providing the time taken by multiple system calls in a program

Using –c option strace provides summary information on executing a program.

It provides information like number of times a system call is used, time spent executing various system calls, number of times errors returned as shown below.

$ strace -c ./a.out 
Failed to open the file
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.47    0.004000        4000         1           execve
  8.53    0.000373         124         3         3 access
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         3         1 open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         7           mmap2
  0.00    0.000000           0         3           fstat64
  0.00    0.000000           0         1           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.004373                    29         4 total

Redirecting the output to a file

Using -o option we can redirect the complex output of strace into a file.

$ strace -o <output-file-name> <program-name>

Time spent per system call

Using –T option we can get time spent per system call. In the below example we can see time spent per system call is printed at the end of the line.

$ strace -T ./a.out 
execve("./a.out", ["./a.out"], [/* 39 vars */]) = 0 <0.003256>
.
brk(0x9db0000)                          = 0x9db0000 <0.000123>
open("test", O_RDONLY)                  = -1 ENOENT (No such file or directory) <0.000154>
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000125>
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77d4000 <0.000121>
write(1, "Failed to open the filen", 24Failed to open the file
) = 24 <0.000258>
exit_group(0)                           = ?

Prefixing time of the day for every line in trace

It is useful sometimes to track at what time a particular is triggered. By using -t option strace will prefix each line of the trace with the time of day, which will be really helpful to find out at particular time at which call is the process blocked.

$ strace -t ./a.out 
execve("./a.out", ["./a.out"], [/* 39 vars */]) = 0 <0.003256>
.
brk(0x9db0000)                          = 0x9db0000 <0.000123>
open("test", O_RDONLY)                  = -1 ENOENT (No such file or directory) <0.000154>
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000125>
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77d4000 <0.000121>
write(1, "Failed to open the filen", 24Failed to open the file
) = 24 <0.000258>
exit_group(0)                           = ?

Tracing only specific system calls

Using –e option we can also specify which system calls to be traced. To trace only open() and close() system calls use the following command:

$ strace –e trace=’open,close’ <program-name>

Similarly we can also use negation option to not trace specific system calls. If we don’t want to trace open() system call in previous example we can give the below command

$ strace -e trace='!open,close' ./a.out

Check the man page of strace for other options.