2.10 Profiling the kernel

There are several facilities to see where the kernel spends its resources. A simple one is the profiling function, that stores the current EIP (instruction pointer) at each clock tick.

Boot the kernel with command line option profile=2 (or some other number instead of 2). This will cause a file /proc/profile to be created. The number given after profile= is the number of positions EIP is shifted right when profiling. So a large number gives a coarse profile. The counters are reset by writing to /proc/profile. The utility readprofile will output statistics for you. It does not sort - you have to invoke sort explicitly. But given a memory map it will translate addresses to kernel symbols.

See kernel/profile.c and fs/proc/proc_misc.c and readprofile(1).

For example:

# echo > /proc/profile
...
# readprofile -m System.map-2.5.59 | sort -nr | head -2
510502 total                                      0.1534
508548 default_idle                           10594.7500

The first column gives the number of timer ticks. The last column gives the number of ticks divided by the size of the function.

The command readprofile -r is equivalent to echo > /proc/profile.

2.10.1 Oprofile

A more advanced mechanism is given by oprofile.

Prepare kernel

Build a kernel (2.5.43 or later) with CONFIG_PROFILING=y and CONFIG_OPROFILE=y. Now the kernel knows about the oprofilefs virtual filesystem. The utilities mentioned below will mount it on /dev/oprofile. It is a good idea to add idle=poll to the kernel command line; this will make sure time spent in the idle thread is properly accounted for.

Install oprofile

Get the oprofile utility from http://oprofile.sourceforge.net/. Configure with ./configure --with-kernel-support, then make and (as root) make install. This yields binaries and a man page:

% ls /usr/local/bin
op_dump  op_merge    op_start  op_time       opcontrol    oprofiled
op_help  op_session  op_stop   op_to_source  oprof_start  oprofpp
% ls /usr/local/share/oprofile
stl.pat
% ls /usr/local/share/doc/oprofile
oprofile.html
% ls /usr/local/man/man1
op_help.1   op_time.1       opcontrol.1  oprofiled.1
op_merge.1  op_to_source.1  oprofile.1   oprofpp.1
%
All man pages are links to the oprofile man page. Thus, there are two sources of information:
% man oprofile
% mozilla /usr/local/share/doc/oprofile/oprofile.html

Setup oprofile

Make sure you are root and your PATH contains /usr/local/bin. The first action required is invoking opcontrol --setup .... This will create a setup file /root/.oprofile/daemonrc. You need the big kernel toplevel vmlinux file (not bzImage, and not the smaller arch/i386/boot/compressed/vmlinux).

# opcontrol --setup --vmlinux=/foo/vmlinux
# cat /root/.oprofile/daemonrc
IGNORE_MYSELF=0
SEPARATE_LIB_SAMPLES=0
SEPARATE_KERNEL_SAMPLES=0
VMLINUX=/foo/vmlinux
BUF_SIZE=0
one_enabled=1
#
The precise setup call needed depends on your hardware. The above example is for an old Pentium without hardware performance counters. For a P3, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=100000
For a P4, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=GLOBAL_POWER_EVENTS --ctr0-unit-mask=1 --ctr0-count=100000
For an Athlon or x86-64, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=RETIRED_INSNS --ctr0-count=100000
There are many other possible setup options. The command op_help will list them once you have done the setup :-). (See also the sourceforge site for Intel P6/PII/PIII, Intel P4 and AMD events.)

The general idea is that one specifies a certain type of event (such as CPU_CLK_UNHALTED, a CPU clock cycle), and a count (like the 100000 above). Now once every count occurrences of this event the value of EIP is recorded. Pick the value of count suitably: with a 400 MHz machine a count of 100000 for CPU_CLK_UNHALTED means 4000 interrupts/second. Too small a count and the machine dies an interrupt death. Too large a count and the profile will be very coarse.

Use of oprofile

First start the oprofile daemon.

# opcontrol --start-daemon
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
# ps ax | grep oprofile
... /usr/local/bin/oprofiled ...

Next clear out old profiling data.

# opcontrol --reset

Next start measuring, do whatever should be measured, and stop measuring.

# opcontrol --start
Profiler running.
# do_something
# opcontrol --stop
Stopping profiling.
#

One now has profiling data below /var/lib/oprofile/. See below for what to do with the data.

When no more profiling is needed, kill the daemon:

# opcontrol --shutdown
Stopping profiling.
Killing daemon.
#

To clean up all data generated by oprofile (after generating any desired output):

# rm -r /var/lib/oprofile

A partial cleanup is done by the opcontrol --reset mentioned above.

Output

To get a printout of the data for /bin/foo, use, e.g. oprofpp -l -i /bin/foo. The output will be boring, unless /bin/foo was not stripped after compilation, so that it contains symbol information.

The programs and libraries that were invoked (and hence are suitable arguments for oprofpp -l -i ...) and the numbers of ticks spent in each are given by the command op_time. With the -l option the output will be split according to symbol.

# op_time | tail -3
3134       9.0531  0.0000 /lib/i686/libc-2.2.4.so
4813      13.9032  0.0000 /usr/bin/find
26212     75.7178  0.0000 /foo/vmlinux
# op_time -l | tail -6
c012c7c0 703      2.04509     kmem_cache_alloc        /foo/vmlinux
c02146c0 786      2.28655     __copy_to_user_ll       /foo/vmlinux
c0169b70 809      2.35345     ext3_readdir            /foo/vmlinux
c01476f0 854      2.48436     link_path_walk          /foo/vmlinux
c016fcd0 1446     4.20655     ext3_find_entry         /foo/vmlinux
00000000 4591     13.3556     (no symbol)             /usr/bin/find
#

To get statistics for the kernel only, use the vmlinux name specified.

# oprofpp -l -i /foo/vmlinux | tail
c012ca30 488      1.86174     kmem_cache_free
c010e280 496      1.89226     mask_and_ack_8259A
c010a61a 506      1.93041     restore_all
c0119220 603      2.30047     do_softirq
c0110b30 663      2.52938     delay_tsc
c012c7c0 703      2.68198     kmem_cache_alloc
c02146c0 786      2.99863     __copy_to_user_ll
c0169b70 809      3.08637     ext3_readdir
c01476f0 854      3.25805     link_path_walk
c016fcd0 1446     5.51656     ext3_find_entry
#

One can get disassembly or annotated source with indication on where the counts occurred.

# op_to_source -a -i /foo/vmlinux
...
/* 424 1.618% */
c0169ac0 <ext3_check_dir_entry>:
 /* 6 0.02289% */
c0169ac0:       push   %edi
 /* 56 0.2136% */
c0169ac1:       push   %esi
c0169ac2:       push   %ebx
c0169ac3:       mov    0x18(%esp,1),%esi
 /* 43 0.164% */
c0169ac7:       xor    %ebx,%ebx
c0169ac9:       mov    0x14(%esp,1),%edi
c0169acd:       movzwl 0x4(%esi),%ecx
 /* 23 0.08775% */
c0169ad1:       cmp    $0xb,%ecx
c0169ad4:       jg     c0169ae0 <ext3_check_dir_entry+0x20>
c0169ad6:       mov    $0xc032b160,%ebx
...

# op_to_source --source-dir=. --output-dir=/tmp -i /path/binary
# diff -u ./source.c /tmp
...
 int get_line(register FILE *f, int *length)
+/* get_line 24 54.55% */
...
+       /* 5 11.36% */
        c = Getc (f);
...

This profile with annotated source is available only for binaries compiled with -g2 (and not stripped) so that they still contain source line numbers.