There are several facilities to see where the kernel spends its resources. A simple one is the profiling function, that stores the current EIP (instruction pointer) at each clock tick.
Boot the kernel with command line option profile=2
(or some other number instead of 2). This will cause
a file /proc/profile
to be created.
The number given after profile=
is the number of positions
EIP is shifted right when profiling. So a large number gives a
coarse profile.
The counters are reset by writing to /proc/profile
.
The utility readprofile
will output statistics for you.
It does not sort - you have to invoke sort
explicitly.
But given a memory map it will translate addresses to kernel symbols.
See kernel/profile.c
and fs/proc/proc_misc.c
and readprofile(1)
.
For example:
# echo > /proc/profile ... # readprofile -m System.map-2.5.59 | sort -nr | head -2 510502 total 0.1534 508548 default_idle 10594.7500
The first column gives the number of timer ticks. The last column gives the number of ticks divided by the size of the function.
The command readprofile -r
is equivalent to
echo > /proc/profile
.
A more advanced mechanism is given by oprofile.
Build a kernel (2.5.43 or later) with CONFIG_PROFILING=y
and CONFIG_OPROFILE=y. Now the kernel knows about the
oprofilefs
virtual filesystem. The utilities
mentioned below will mount it on /dev/oprofile
.
It is a good idea to add idle=poll
to the kernel
command line; this will make sure time spent in the idle thread
is properly accounted for.
Get the oprofile utility from
http://oprofile.sourceforge.net/.
Configure with ./configure --with-kernel-support
,
then make
and (as root) make install
.
This yields binaries and a man page:
% ls /usr/local/bin op_dump op_merge op_start op_time opcontrol oprofiled op_help op_session op_stop op_to_source oprof_start oprofpp % ls /usr/local/share/oprofile stl.pat % ls /usr/local/share/doc/oprofile oprofile.html % ls /usr/local/man/man1 op_help.1 op_time.1 opcontrol.1 oprofiled.1 op_merge.1 op_to_source.1 oprofile.1 oprofpp.1 %All man pages are links to the
oprofile
man page.
Thus, there are two sources of information:
% man oprofile % mozilla /usr/local/share/doc/oprofile/oprofile.html
Make sure you are root and your PATH contains /usr/local/bin
.
The first action required is invoking opcontrol --setup ...
.
This will create a setup file /root/.oprofile/daemonrc
.
You need the big kernel toplevel vmlinux
file (not bzImage
,
and not the smaller arch/i386/boot/compressed/vmlinux
).
# opcontrol --setup --vmlinux=/foo/vmlinux # cat /root/.oprofile/daemonrc IGNORE_MYSELF=0 SEPARATE_LIB_SAMPLES=0 SEPARATE_KERNEL_SAMPLES=0 VMLINUX=/foo/vmlinux BUF_SIZE=0 one_enabled=1 #The precise setup call needed depends on your hardware. The above example is for an old Pentium without hardware performance counters. For a P3, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=100000For a P4, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=GLOBAL_POWER_EVENTS --ctr0-unit-mask=1 --ctr0-count=100000For an Athlon or x86-64, use
opcontrol --setup --vmlinux=/foo/vmlinux --ctr0-event=RETIRED_INSNS --ctr0-count=100000There are many other possible setup options. The command
op_help
will list them once you have done the setup :-).
(See also the sourceforge site for
Intel P6/PII/PIII,
Intel P4 and
AMD events.)
The general idea is that one specifies a certain type of event (such as CPU_CLK_UNHALTED, a CPU clock cycle), and a count (like the 100000 above). Now once every count occurrences of this event the value of EIP is recorded. Pick the value of count suitably: with a 400 MHz machine a count of 100000 for CPU_CLK_UNHALTED means 4000 interrupts/second. Too small a count and the machine dies an interrupt death. Too large a count and the profile will be very coarse.
First start the oprofile daemon.
# opcontrol --start-daemon Using log file /var/lib/oprofile/oprofiled.log Daemon started. # ps ax | grep oprofile ... /usr/local/bin/oprofiled ...
Next clear out old profiling data.
# opcontrol --reset
Next start measuring, do whatever should be measured, and stop measuring.
# opcontrol --start Profiler running. # do_something # opcontrol --stop Stopping profiling. #
One now has profiling data below /var/lib/oprofile/
.
See below for what to do with the data.
When no more profiling is needed, kill the daemon:
# opcontrol --shutdown Stopping profiling. Killing daemon. #
To clean up all data generated by oprofile (after generating any desired output):
# rm -r /var/lib/oprofile
A partial cleanup is done by the opcontrol --reset
mentioned above.
To get a printout of the data for /bin/foo
, use, e.g.
oprofpp -l -i /bin/foo
. The output will be boring,
unless /bin/foo
was not stripped after compilation,
so that it contains symbol information.
The programs and libraries that were invoked (and hence are suitable
arguments for oprofpp -l -i ...
) and the numbers of ticks spent
in each are given by the command op_time
.
With the -l
option the output will be split according to symbol.
# op_time | tail -3 3134 9.0531 0.0000 /lib/i686/libc-2.2.4.so 4813 13.9032 0.0000 /usr/bin/find 26212 75.7178 0.0000 /foo/vmlinux # op_time -l | tail -6 c012c7c0 703 2.04509 kmem_cache_alloc /foo/vmlinux c02146c0 786 2.28655 __copy_to_user_ll /foo/vmlinux c0169b70 809 2.35345 ext3_readdir /foo/vmlinux c01476f0 854 2.48436 link_path_walk /foo/vmlinux c016fcd0 1446 4.20655 ext3_find_entry /foo/vmlinux 00000000 4591 13.3556 (no symbol) /usr/bin/find #
To get statistics for the kernel only, use the vmlinux
name specified.
# oprofpp -l -i /foo/vmlinux | tail c012ca30 488 1.86174 kmem_cache_free c010e280 496 1.89226 mask_and_ack_8259A c010a61a 506 1.93041 restore_all c0119220 603 2.30047 do_softirq c0110b30 663 2.52938 delay_tsc c012c7c0 703 2.68198 kmem_cache_alloc c02146c0 786 2.99863 __copy_to_user_ll c0169b70 809 3.08637 ext3_readdir c01476f0 854 3.25805 link_path_walk c016fcd0 1446 5.51656 ext3_find_entry #
One can get disassembly or annotated source with indication on where the counts occurred.
# op_to_source -a -i /foo/vmlinux ... /* 424 1.618% */ c0169ac0 <ext3_check_dir_entry>: /* 6 0.02289% */ c0169ac0: push %edi /* 56 0.2136% */ c0169ac1: push %esi c0169ac2: push %ebx c0169ac3: mov 0x18(%esp,1),%esi /* 43 0.164% */ c0169ac7: xor %ebx,%ebx c0169ac9: mov 0x14(%esp,1),%edi c0169acd: movzwl 0x4(%esi),%ecx /* 23 0.08775% */ c0169ad1: cmp $0xb,%ecx c0169ad4: jg c0169ae0 <ext3_check_dir_entry+0x20> c0169ad6: mov $0xc032b160,%ebx ...
# op_to_source --source-dir=. --output-dir=/tmp -i /path/binary # diff -u ./source.c /tmp ... int get_line(register FILE *f, int *length) +/* get_line 24 54.55% */ ... + /* 5 11.36% */ c = Getc (f); ...
This profile with annotated source is available only for binaries
compiled with -g2
(and not stripped) so that they still
contain source line numbers.