WPA–Xperf Trace Analysis Reimagined
For many years xperfview.exe has been the main tool for analyzing xperf/ETW traces. However starting last fall the Windows Performance Toolkit started including wpa.exe as an alternative. While the early versions had some significant rough edges, the latest version (6.2.8400.0, released in tandem with Windows 8 RC) is now superior to xperfview in most ways.
In this post I’ll briefly explain how to switch from using xperfview to WPA, and why this is worthwhile.
Before proceeding you should be sure to get the latest version of WPA, to avoid being frustrated by bugs that are already fixed. The latest version as of June 19, 2012, is available as part of the ADK or as part of the Windows Software Development Kit (SDK) for Windows 8 Release Preview. Gotta love those Microsoft names.
This post assumes some familiarity with xperf/ETW and xperfview. In particular you should definitely know now to record a trace. If you don’t follow those instructions then you may find that some of the graphs below are empty. Don’t say I didn’t warn you. You can also grab the sample trace from my wait analysis post.
It may be useful to understand xperf analysis basics, xperf for excess CPU consumption, and xperf wait analysis, but much of the UI flow is quite different with WPA, so some of the details will be different.
The two articles in the xperf documentation series, CPU sampling and CPU scheduling, are still relevant – understanding the column names is as important as ever. Those articles are timeless classics. The UI for selecting columns has changed, but that is easy enough to adjust to.
When you first launch WPA it is dauntingly austere and blank. All of the graphs are collapsed and hidden:
The first step is to start dragging some graphs into the analysis area. The exact set of graphs depends on what type of analysis you are planning to do, and I don’t claim that my recommendations are one-size-fits-all, but they should be a reasonable starting point.
App-specific generic events can be crucial for navigating a trace. I use them to identify frame boundaries, user input, and other key events. With them I can see when the frame-rate drops, and I can see what events the drop correlates to. If you have followed the instructions at “Xperf Basics: Recording a Trace” then you should be calling functions like ETWMark() and ETWRenderFrameMark() – or whatever alternative functions you created – and the time-stamped data from these functions shows up in Generic Events.
The display of generic events in xperfview, the old trace viewing tool, left much to be desired:
If I looked in the ProviderIds drop-down then I could see that these blue diamonds corresponded to my Valve-FrameRate provider, but that provider emits events for simulation ticks as well as render ticks, and the distinction is invisible in xperfview.
The display of generic events in WPA is much improved. The events are displayed hierarchically so that I can drill down and see the simulation and render ticks on separate lines. The layout is configurable so if I want to I can save a bit of space by removing “Task Name” from the hierarchy:
If Generic Events are relevant to your trace analysis then open up the System Activity section and drag Generic Events onto the analysis area. On the other hand, if you haven’t created and registered your own providers and put in calls to their functions then there won’t be much of interest in this graph and you should probably not bother with it.
Window in Focus
This graph shows what process’ window is active. This can be helpful to choose where to analyze. Most games lower their frame rate when they are not active and investigating a poor frame rate when you’re intentionally going idle is pointless.
In the graph shown below the user switched from explorer to WLXPhotoGallery, and then DWM owned the active window for a while.
Aside: DWM becomes active whenever Windows detects that a program has hung – DWM takes over window management so that the user can still move the hung window around.
The Windows in Focus graph can be found in the System Activity section. Drag it over if you think it will be useful.
CPU Usage (Precise)
CPU Usage (Precise) is the graph formerly known as CPU Usage (there were three variants for grouping by thread/process/CPU). This graph is constructed from context switch data (and interrupt and DPC data) which means it is a sub-microsecond accurate measurement of what thread is running when and why, as well as exactly how much CPU time each thread and process is consuming. That seems pretty important so you probably want this one available. If you open up the Computation section and then within that open up the CPU Usage (Precise) section then you will find several graphs. I normally use the “Utilization by Process, Thread” graph, but the “Timeline by Process, Thread” graph also looks interesting, and makes some patterns more obvious. In particular, if two threads or processes are ping-ponging (taking turns executing) this behavior is much easier to follow on the timeline graph. However the timeline graph devolves into solid bars of color when zoomed out – it really only works when examining fine details.
It is unfortunate that the color coding for the utilization and timeline graphs are not consistent.
The screen shot to the right shows the utilization and timeline graphs over a 20 ms time period.
CPU Usage (Sampled)
CPU Usage (Sampled) is the graph formerly known as CPU Sampling (there were three variants for grouping by thread/process/CPU). This data is constructed using data from the profile provider, which periodically interrupts all CPUs to see what they are doing, with a default rate of 1 KHz. Because this data is sampled it cannot tell you what is going on between samples. However with enough samples and with well behaved code the sampling data can be extremely useful. In particular, if you have call stack collection enabled for the profile provider then each sample includes a call stack.
In general I would say that the main purpose of the sampled CPU data is looking at the call stacks, because this tells you (subject to sampling artifacts) what your code is actually doing. Therefore it seems passing strange that WPA defaults to not showing you call stacks.
Since most programs are CPU bound at least part of the time I think this data should always be enabled, so open up the Computation Section, then open up the CPU Usage (Sampled section) and grab one of the graphs. I’d recommend “Utilization by Process”.
When you drag this graph over you now have two graphs showing CPU usage – one precise and one sampled. That’s a waste of space. The precise CPU usage is, well, more precise so we want its graph, and all we really want from the sampled data is call stacks, and those are on the summary table. That means that we now make our first use of the buttons on the right side of the graph’s title bar. The three left buttons let you control whether to display “graph and table”, “graph only”, or “table only”. The default is graph only, but for CPU Usage (Sampled) I think we want table only, as shown above. This gives us a table of precise CPU usage, and a graph of sample CPU usage. Perfect.
The default set of columns for CPU Usage (Sampled) is similar to what is was for xperfview, which means that I think that it is wrong. In particular, the stack column is off by default which is peculiar since it is the whole raison d'être for the sampled table. For details on the available columns see my CPU sampling documentation, or just trust me and use this set of columns for a starting point:
- Orange bar
To set the columns click on the View Editor (Ctrl+E) button that is circled in red in the screen shot below:
When you drill in to the sampled data call stacks the graphs are highlighted to show where the samples came from, which can be invaluable data once you learn how to interpret it.
At this point you should have something beautiful like this:
What about wait analysis?
Wait analysis is crucial for solving some types of performance problems and the table necessary for doing this type of analysis is attached to CPU Usage (Precise), since wait analysis uses context switch data. The first step is to enable the graph by clicking “Display graph and table”. Then you will need to use the View Editor (Ctrl+E) to dramatically change the set of columns. A full description of the columns can be found here, or just go with this wait-analysis-appropriate set:
- Orange bar
- TimeSinceLast (us) – Sum (sort by descending)
- Ready (us) – Sum
- Waits (us) – Sum
- Freeze bar
- CPU Usage (ms)
The freeze bar means that CPU Usage (ms) will continue to be displayed as the right-most column even when the window is too narrow for all columns, which makes the summary table useful for an accurate per-process/thread measure of CPU consumption.
I normally have the CPU usage (precise) summary table turned off and I make it visible when I need it. However I always have my favorite column orderings set up, ready and waiting for me.
Profiles-> Save Startup Profile
Once you get everything perfectly arranged, don’t forget to save your startup profile. If you create a profile that is well suited to the type of analysis that your company does then you can share it with your coworkers – the data is stored in an XML file in “Documents\WPA Files\Startup.wpaProfile”.
To save you some time, and to make creating tutorials easier, I’ve created a startup profile that you can install. I don’t claim that it is the one-true-startup-profile, but I think it’s a good place to begin and it contains everything I’ve discussed above. You can find this startup profile at ftp://ftp.cygnus-software.com/pub/WPAStartup.zip. Just download it and unzip the file into your “Documents\WPA Files” directory.
As with xperfview you can zoom in/out with ctrl+mouse-wheel. You can also select a region and then zoom to that region, either in the main window (right-click, Zoom), in a new tab (right-click, Zoom all in new view), or just the current graph in a new tab (right-click, Zoom graph in new view). The workflow is a bit different from xperfview but ultimately the ability to do deep analysis in a single window makes it worthwhile.
In addition to mouse and mouse-wheel based zooming you can use ctrl+”+” and ctrl+”-“ for keyboard based zooming. They don’t zoom around the mouse, but at least I can finally zoom in/out a bit when I don’t have a mouse-wheel available.
WPA continues the fine xperfview tradition of hiding some of the disk analysis graphs. In particular the cool chart showing disk-head movements is now found by opening up the Storage section, right-clicking on Disk Usage, and selecting Disk Offset Graph. There’s something amazing and terrifying about seeing how far the physical read/write head on your poor disk sometimes ends up moving.
WPA has improved the trace analysis experience in several ways. Some of these include:
- Asynchronous symbol loading – keep working during the (sometimes slow) symbol loading process
- Easy saving and sharing of startup profiles
- Deep analysis in one window
- Better display of Generic events
- Ctrl+”+” and ctrl+”-“ for keyboard based zooming
- Highlighting of relevant graph sections – makes viewing profile sample locations easier (more on this later)
WPA hasn’t quite reached its first full release and it does have some frustrating bugs and glitches, particularly around the use of Generic Events. Some of these are serious enough that they force me to use xperfview to extract some information, but I still use WPA for most day-to-day work. Your mileage may vary. The main glitches I have seen are:
- Sorting of numeric fields in Generic Events is incorrect – the sorting is done alphabetically instead of numerically
- Some Generic Events payloads aren’t decoded
- Snapping to events when selecting a time range doesn’t work
- Some parts of the UI still hang or are anomalously unresponsive during symbol loading
The xperf team has been responsive to bug reports and I assume that most of these bugs will be fixed in the final Windows 8 version. Fingers crossed.
I keep finding that ETW/xperf/WPA give me access to information that most other developers don’t have. This lets me find and fix performance problems that are otherwise invisible or intractable. I continue to enjoy having x-ray glasses that actually work.