3.1.2.1. Using perf to do Basic Profiling

As a simple test case, we'll profile the 'wget' of a fairly large file, which is a minimally interesting case because it has both file and network I/O aspects, and at least in the case of standard Yocto images, it's implemented as part of busybox, so the methods we use to analyze it can be used in a very similar way to the whole host of supported busybox applets in Yocto.

     root@crownbay:~# rm linux-2.6.19.2.tar.bz2; \
     wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
                

The quickest and easiest way to get some basic overall data about what's going on for a particular workload it to profile it using 'perf stat'. 'perf stat' basically profiles using a few default counters and displays the summed counts at the end of the run:

     root@crownbay:~# perf stat wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
     Connecting to downloads.yoctoproject.org (140.211.169.59:80)
     linux-2.6.19.2.tar.b 100% |***************************************************| 41727k  0:00:00 ETA

     Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':

           4597.223902 task-clock                #    0.077 CPUs utilized
                 23568 context-switches          #    0.005 M/sec
                    68 CPU-migrations            #    0.015 K/sec
                   241 page-faults               #    0.052 K/sec
            3045817293 cycles                    #    0.663 GHz
       <not supported> stalled-cycles-frontend
       <not supported> stalled-cycles-backend
             858909167 instructions              #    0.28  insns per cycle
             165441165 branches                  #   35.987 M/sec
              19550329 branch-misses             #   11.82% of all branches

          59.836627620 seconds time elapsed
                

Many times such a simple-minded test doesn't yield much of interest, but sometimes it does (see Real-world Yocto bug (slow loop-mounted write speed)).

Also, note that 'perf stat' isn't restricted to a fixed set of counters - basically any event listed in the output of 'perf list' can be tallied by 'perf stat'. For example, suppose we wanted to see a summary of all the events related to kernel memory allocation/freeing along with cache hits and misses:

     root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
     Connecting to downloads.yoctoproject.org (140.211.169.59:80)
     linux-2.6.19.2.tar.b 100% |***************************************************| 41727k  0:00:00 ETA

     Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':

                  5566 kmem:kmalloc
                125517 kmem:kmem_cache_alloc
                     0 kmem:kmalloc_node
                     0 kmem:kmem_cache_alloc_node
                 34401 kmem:kfree
                 69920 kmem:kmem_cache_free
                   133 kmem:mm_page_free
                    41 kmem:mm_page_free_batched
                 11502 kmem:mm_page_alloc
                 11375 kmem:mm_page_alloc_zone_locked
                     0 kmem:mm_page_pcpu_drain
                     0 kmem:mm_page_alloc_extfrag
              66848602 cache-references
               2917740 cache-misses              #    4.365 % of all cache refs

          44.831023415 seconds time elapsed
                

So 'perf stat' gives us a nice easy way to get a quick overview of what might be happening for a set of events, but normally we'd need a little more detail in order to understand what's going on in a way that we can act on in a useful way.

To dive down into a next level of detail, we can use 'perf record'/'perf report' which will collect profiling data and present it to use using an interactive text-based UI (or simply as text if we specify --stdio to 'perf report').

As our first attempt at profiling this workload, we'll simply run 'perf record', handing it the workload we want to profile (everything after 'perf record' and any perf options we hand it - here none - will be executedin a new shell). perf collects samples until the process exits and records them in a file named 'perf.data' in the current working directory.

     root@crownbay:~# perf record wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2

     Connecting to downloads.yoctoproject.org (140.211.169.59:80)
     linux-2.6.19.2.tar.b 100% |************************************************| 41727k  0:00:00 ETA
     [ perf record: Woken up 1 times to write data ]
     [ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ]
            

To see the results in a 'text-based UI' (tui), simply run 'perf report', which will read the perf.data file in the current working directory and display the results in an interactive UI:

     root@crownbay:~# perf report
                

The above screenshot displays a 'flat' profile, one entry for each 'bucket' corresponding to the functions that were profiled during the profiling run, ordered from the most popular to the least (perf has options to sort in various orders and keys as well as display entries only above a certain threshold and so on - see the perf documentation for details). Note that this includes both userspace functions (entries containing a [.]) and kernel functions accounted to the process (entries containing a [k]). (perf has command-line modifiers that can be used to restrict the profiling to kernel or userspace, among others).

Notice also that the above report shows an entry for 'busybox', which is the executable that implements 'wget' in Yocto, but that instead of a useful function name in that entry, it displays an not-so-friendly hex value instead. The steps below will show how to fix that problem.

Before we do that, however, let's try running a different profile, one which shows something a little more interesting. The only difference between the new profile and the previous one is that we'll add the -g option, which will record not just the address of a sampled function, but the entire callchain to the sampled function as well:

     root@crownbay:~# perf record -g wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
     Connecting to downloads.yoctoproject.org (140.211.169.59:80)
     linux-2.6.19.2.tar.b 100% |************************************************| 41727k  0:00:00 ETA
     [ perf record: Woken up 3 times to write data ]
     [ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ]


     root@crownbay:~# perf report
                

Using the callgraph view, we can actually see not only which functions took the most time, but we can also see a summary of how those functions were called and learn something about how the program interacts with the kernel in the process.

Notice that each entry in the above screenshot now contains a '+' on the left-hand side. This means that we can expand the entry and drill down into the callchains that feed into that entry. Pressing 'enter' on any one of them will expand the callchain (you can also press 'E' to expand them all at the same time or 'C' to collapse them all).

In the screenshot above, we've toggled the __copy_to_user_ll() entry and several subnodes all the way down. This lets us see which callchains contributed to the profiled __copy_to_user_ll() function which contributed 1.77% to the total profile.

As a bit of background explanation for these callchains, think about what happens at a high level when you run wget to get a file out on the network. Basically what happens is that the data comes into the kernel via the network connection (socket) and is passed to the userspace program 'wget' (which is actually a part of busybox, but that's not important for now), which takes the buffers the kernel passes to it and writes it to a disk file to save it.

The part of this process that we're looking at in the above call stacks is the part where the kernel passes the data it's read from the socket down to wget i.e. a copy-to-user.

Notice also that here there's also a case where the a hex value is displayed in the callstack, here in the expanded sys_clock_gettime() function. Later we'll see it resolve to a userspace function call in busybox.

The above screenshot shows the other half of the journey for the data - from the wget program's userspace buffers to disk. To get the buffers to disk, the wget program issues a write(2), which does a copy-from-user to the kernel, which then takes care via some circuitous path (probably also present somewhere in the profile data), to get it safely to disk.

Now that we've seen the basic layout of the profile data and the basics of how to extract useful information out of it, let's get back to the task at hand and see if we can get some basic idea about where the time is spent in the program we're profiling, wget. Remember that wget is actually implemented as an applet in busybox, so while the process name is 'wget', the executable we're actually interested in is busybox. So let's expand the first entry containing busybox:

Again, before we expanded we saw that the function was labeled with a hex value instead of a symbol as with most of the kernel entries. Expanding the busybox entry doesn't make it any better.

The problem is that perf can't find the symbol information for the busybox binary, which is actually stripped out by the Yocto build system.

One way around that is to put the following in your local.conf when you build the image:

     INHIBIT_PACKAGE_STRIP = "1"
                

However, we already have an image with the binaries stripped, so what can we do to get perf to resolve the symbols? Basically we need to install the debuginfo for the busybox package.

To generate the debug info for the packages in the image, we can to add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:

     EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"
                

Additionally, in order to generate the type of debuginfo that perf understands, we also need to add the following to local.conf:

     PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'
                

Once we've done that, we can install the debuginfo for busybox. The debug packages once built can be found in build/tmp/deploy/rpm/* on the host system. Find the busybox-dbg-...rpm file and copy it to the target. For example:

     [trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2/busybox-dbg-1.20.2-r2.core2.rpm root@192.168.1.31:
     root@192.168.1.31's password:
     busybox-dbg-1.20.2-r2.core2.rpm                     100% 1826KB   1.8MB/s   00:01
                

Now install the debug rpm on the target:

     root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2.rpm
                

Now that the debuginfo is installed, we see that the busybox entries now display their functions symbolically:

If we expand one of the entries and press 'enter' on a leaf node, we're presented with a menu of actions we can take to get more information related to that entry:

One of these actions allows us to show a view that displays a busybox-centric view of the profiled functions (in this case we've also expanded all the nodes using the 'E' key):

Finally, we can see that now that the busybox debuginfo is installed, the previously unresolved symbol in the sys_clock_gettime() entry mentioned previously is now resolved, and shows that the sys_clock_gettime system call that was the source of 6.75% of the copy-to-user overhead was initiated by the handle_input() busybox function:

At the lowest level of detail, we can dive down to the assembly level and see which instructions caused the most overhead in a function. Pressing 'enter' on the 'udhcpc_main' function, we're again presented with a menu:

Selecting 'Annotate udhcpc_main', we get a detailed listing of percentages by instruction for the udhcpc_main function. From the display, we can see that over 50% of the time spent in this function is taken up by a couple tests and the move of a constant (1) to a register:

As a segue into tracing, let's try another profile using a different counter, something other than the default 'cycles'.

The tracing and profiling infrastructure in Linux has become unified in a way that allows us to use the same tool with a completely different set of counters, not just the standard hardware counters that traditionally tools have had to restrict themselves to (of course the traditional tools can also make use of the expanded possibilities now available to them, and in some cases have, as mentioned previously).

We can get a list of the available events that can be used to profile a workload via 'perf list':

     root@crownbay:~# perf list

     List of pre-defined events (to be used in -e):
      cpu-cycles OR cycles                               [Hardware event]
      stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
      stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
      instructions                                       [Hardware event]
      cache-references                                   [Hardware event]
      cache-misses                                       [Hardware event]
      branch-instructions OR branches                    [Hardware event]
      branch-misses                                      [Hardware event]
      bus-cycles                                         [Hardware event]
      ref-cycles                                         [Hardware event]

      cpu-clock                                          [Software event]
      task-clock                                         [Software event]
      page-faults OR faults                              [Software event]
      minor-faults                                       [Software event]
      major-faults                                       [Software event]
      context-switches OR cs                             [Software event]
      cpu-migrations OR migrations                       [Software event]
      alignment-faults                                   [Software event]
      emulation-faults                                   [Software event]

      L1-dcache-loads                                    [Hardware cache event]
      L1-dcache-load-misses                              [Hardware cache event]
      L1-dcache-prefetch-misses                          [Hardware cache event]
      L1-icache-loads                                    [Hardware cache event]
      L1-icache-load-misses                              [Hardware cache event]
      .
      .
      .
      rNNN                                               [Raw hardware event descriptor]
      cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
       (see 'perf list --help' on how to encode it)

      mem:<addr>[:access]                                [Hardware breakpoint]

      sunrpc:rpc_call_status                             [Tracepoint event]
      sunrpc:rpc_bind_status                             [Tracepoint event]
      sunrpc:rpc_connect_status                          [Tracepoint event]
      sunrpc:rpc_task_begin                              [Tracepoint event]
      skb:kfree_skb                                      [Tracepoint event]
      skb:consume_skb                                    [Tracepoint event]
      skb:skb_copy_datagram_iovec                        [Tracepoint event]
      net:net_dev_xmit                                   [Tracepoint event]
      net:net_dev_queue                                  [Tracepoint event]
      net:netif_receive_skb                              [Tracepoint event]
      net:netif_rx                                       [Tracepoint event]
      napi:napi_poll                                     [Tracepoint event]
      sock:sock_rcvqueue_full                            [Tracepoint event]
      sock:sock_exceed_buf_limit                         [Tracepoint event]
      udp:udp_fail_queue_rcv_skb                         [Tracepoint event]
      hda:hda_send_cmd                                   [Tracepoint event]
      hda:hda_get_response                               [Tracepoint event]
      hda:hda_bus_reset                                  [Tracepoint event]
      scsi:scsi_dispatch_cmd_start                       [Tracepoint event]
      scsi:scsi_dispatch_cmd_error                       [Tracepoint event]
      scsi:scsi_eh_wakeup                                [Tracepoint event]
      drm:drm_vblank_event                               [Tracepoint event]
      drm:drm_vblank_event_queued                        [Tracepoint event]
      drm:drm_vblank_event_delivered                     [Tracepoint event]
      random:mix_pool_bytes                              [Tracepoint event]
      random:mix_pool_bytes_nolock                       [Tracepoint event]
      random:credit_entropy_bits                         [Tracepoint event]
      gpio:gpio_direction                                [Tracepoint event]
      gpio:gpio_value                                    [Tracepoint event]
      block:block_rq_abort                               [Tracepoint event]
      block:block_rq_requeue                             [Tracepoint event]
      block:block_rq_issue                               [Tracepoint event]
      block:block_bio_bounce                             [Tracepoint event]
      block:block_bio_complete                           [Tracepoint event]
      block:block_bio_backmerge                          [Tracepoint event]
      .
      .
      writeback:writeback_wake_thread                    [Tracepoint event]
      writeback:writeback_wake_forker_thread             [Tracepoint event]
      writeback:writeback_bdi_register                   [Tracepoint event]
      .
      .
      writeback:writeback_single_inode_requeue           [Tracepoint event]
      writeback:writeback_single_inode                   [Tracepoint event]
      kmem:kmalloc                                       [Tracepoint event]
      kmem:kmem_cache_alloc                              [Tracepoint event]
      kmem:mm_page_alloc                                 [Tracepoint event]
      kmem:mm_page_alloc_zone_locked                     [Tracepoint event]
      kmem:mm_page_pcpu_drain                            [Tracepoint event]
      kmem:mm_page_alloc_extfrag                         [Tracepoint event]
      vmscan:mm_vmscan_kswapd_sleep                      [Tracepoint event]
      vmscan:mm_vmscan_kswapd_wake                       [Tracepoint event]
      vmscan:mm_vmscan_wakeup_kswapd                     [Tracepoint event]
      vmscan:mm_vmscan_direct_reclaim_begin              [Tracepoint event]
      .
      .
      module:module_get                                  [Tracepoint event]
      module:module_put                                  [Tracepoint event]
      module:module_request                              [Tracepoint event]
      sched:sched_kthread_stop                           [Tracepoint event]
      sched:sched_wakeup                                 [Tracepoint event]
      sched:sched_wakeup_new                             [Tracepoint event]
      sched:sched_process_fork                           [Tracepoint event]
      sched:sched_process_exec                           [Tracepoint event]
      sched:sched_stat_runtime                           [Tracepoint event]
      rcu:rcu_utilization                                [Tracepoint event]
      workqueue:workqueue_queue_work                     [Tracepoint event]
      workqueue:workqueue_execute_end                    [Tracepoint event]
      signal:signal_generate                             [Tracepoint event]
      signal:signal_deliver                              [Tracepoint event]
      timer:timer_init                                   [Tracepoint event]
      timer:timer_start                                  [Tracepoint event]
      timer:hrtimer_cancel                               [Tracepoint event]
      timer:itimer_state                                 [Tracepoint event]
      timer:itimer_expire                                [Tracepoint event]
      irq:irq_handler_entry                              [Tracepoint event]
      irq:irq_handler_exit                               [Tracepoint event]
      irq:softirq_entry                                  [Tracepoint event]
      irq:softirq_exit                                   [Tracepoint event]
      irq:softirq_raise                                  [Tracepoint event]
      printk:console                                     [Tracepoint event]
      task:task_newtask                                  [Tracepoint event]
      task:task_rename                                   [Tracepoint event]
      syscalls:sys_enter_socketcall                      [Tracepoint event]
      syscalls:sys_exit_socketcall                       [Tracepoint event]
      .
      .
      .
      syscalls:sys_enter_unshare                         [Tracepoint event]
      syscalls:sys_exit_unshare                          [Tracepoint event]
      raw_syscalls:sys_enter                             [Tracepoint event]
      raw_syscalls:sys_exit                              [Tracepoint event]
                

Tying it Together: These are exactly the same set of events defined by the trace event subsystem and exposed by ftrace/tracecmd/kernelshark as files in /sys/kernel/debug/tracing/events, by SystemTap as kernel.trace("tracepoint_name") and (partially) accessed by LTTng.

Only a subset of these would be of interest to us when looking at this workload, so let's choose the most likely subsystems (identified by the string before the colon in the Tracepoint events) and do a 'perf stat' run using only those wildcarded subsystems:

     root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
     Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':

                 23323 skb:kfree_skb
                     0 skb:consume_skb
                 49897 skb:skb_copy_datagram_iovec
                  6217 net:net_dev_xmit
                  6217 net:net_dev_queue
                  7962 net:netif_receive_skb
                     2 net:netif_rx
                  8340 napi:napi_poll
                     0 sched:sched_kthread_stop
                     0 sched:sched_kthread_stop_ret
                  3749 sched:sched_wakeup
                     0 sched:sched_wakeup_new
                     0 sched:sched_switch
                    29 sched:sched_migrate_task
                     0 sched:sched_process_free
                     1 sched:sched_process_exit
                     0 sched:sched_wait_task
                     0 sched:sched_process_wait
                     0 sched:sched_process_fork
                     1 sched:sched_process_exec
                     0 sched:sched_stat_wait
         2106519415641 sched:sched_stat_sleep
                     0 sched:sched_stat_iowait
             147453613 sched:sched_stat_blocked
           12903026955 sched:sched_stat_runtime
                     0 sched:sched_pi_setprio
                  3574 workqueue:workqueue_queue_work
                  3574 workqueue:workqueue_activate_work
                     0 workqueue:workqueue_execute_start
                     0 workqueue:workqueue_execute_end
                 16631 irq:irq_handler_entry
                 16631 irq:irq_handler_exit
                 28521 irq:softirq_entry
                 28521 irq:softirq_exit
                 28728 irq:softirq_raise
                     1 syscalls:sys_enter_sendmmsg
                     1 syscalls:sys_exit_sendmmsg
                     0 syscalls:sys_enter_recvmmsg
                     0 syscalls:sys_exit_recvmmsg
                    14 syscalls:sys_enter_socketcall
                    14 syscalls:sys_exit_socketcall
                       .
                       .
                       .
                 16965 syscalls:sys_enter_read
                 16965 syscalls:sys_exit_read
                 12854 syscalls:sys_enter_write
                 12854 syscalls:sys_exit_write
                       .
                       .
                       .

          58.029710972 seconds time elapsed
                

Let's pick one of these tracepoints and tell perf to do a profile using it as the sampling event:

     root@crownbay:~# perf record -g -e sched:sched_wakeup wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
                

The screenshot above shows the results of running a profile using sched:sched_switch tracepoint, which shows the relative costs of various paths to sched_wakeup (note that sched_wakeup is the name of the tracepoint - it's actually defined just inside ttwu_do_wakeup(), which accounts for the function name actually displayed in the profile:

     /*
      * Mark the task runnable and perform wakeup-preemption.
      */
     static void
     ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
     {
          trace_sched_wakeup(p, true);
          .
          .
          .
     }
                

A couple of the more interesting callchains are expanded and displayed above, basically some network receive paths that presumably end up waking up wget (busybox) when network data is ready.

Note that because tracepoints are normally used for tracing, the default sampling period for tracepoints is 1 i.e. for tracepoints perf will sample on every event occurrence (this can be changed using the -c option). This is in contrast to hardware counters such as for example the default 'cycles' hardware counter used for normal profiling, where sampling periods are much higher (in the thousands) because profiling should have as low an overhead as possible and sampling on every cycle would be prohibitively expensive.