"Calculation of CPU Utilization" or "What Can be Simpler"?
A few days ago our testing team asked me how they could collect CPU utilization of our apps and then build cool plots based on that information. My answer was simple "Give me 15 minutes and I'll send you back a simple instruction or snippet". After 3 days of thinking and experiments I understood that I knew absolutely nothing about CPU Time utilization, how to calculate it and what it means. Anyway, I have received quite interesting results and ready to share them.
Disclaimer
Like everything in this blog this post can be absolutely cruel and contain terrible things. Some of them can be made in an absolutely incorrect and irrational way and even destroy your brain. If you have another opinion or know a more logical way to do something relevant to the current topic, I will be very glad if you express your thoughts in comments or pull-requests.
Let's Use Standard Tools
Before writing a code it is good habit to know how the task has already been solved by smarter people in other applications. The first problem was: Windows Task Manager, Sysinternals Process Explorer and Windows Performance Counters show absolutely different values:
-
Task Manager - Processes, average value is
10.5%
: -
Task Manager - Details, average value is
7%
: -
Sysinternals Process Explorer, average value is
9.5%
: -
Windows Performance Counters, average value is...
106.854%
:
Windows Task Manager was even a leader among them because it showed two significantly different values in different tabs... And yes, there is some noise, but I'm trying to be honest and make screenshots for average values.
The most counterintuitive value was shown by Windows Performance Counter.
"CPU Time, you should be in range [0, 100], go home, you are drunk!" was my first thought when I saw it.
But if we forget that it is percentage and try to understand what can be behind it, to solve this puzzle is quite easy: it is a ratio between the time which process spent on all cores (16) to the interval (1) between measurement, so if we divide it by amount of cores we get 6.67%
which is very close to value of Task Manager - Details.
OK, we already have two votes to 7%
value and an one of sources can be easily automated.
Also, it is the most optimistic value because it is the lowest among all the values, we even get into the dreaming that our app takes so small resources that can be run on a calculator!..
From another side, but my scientific background doesn't allow me to stop here, so continue the experiments.
It is good that we already have some background knowledge that CPU ratio calculations should look like:
CPU = (total_process_time[N] - total_process_time[N-1]) / interval / cpu_count * 100
Windows API provides GetProcessTimes function which can inform us about time spent on a process. And it has a very nice managed wrapper Process.TotalProcessorTime. Let's make a small app that can calculate CPU ratio based on that:
static void Main(string[] args) { const int REP_INTERVAL = 1000; // Find the process of interest var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server"); if (serverProcess == null) return; var nextReport = Environment.TickCount + REP_INTERVAL; var oldProcessorTime = serverProcess.TotalProcessorTime.TotalMilliseconds; for (;;) { // Calculate delta of Total Process Time between measurements var newProcessorTime = serverProcess.TotalProcessorTime.TotalMilliseconds; Console.WriteLine("CPU {0}%", (float)(newProcessorTime - oldProcessorTime) / REP_INTERVAL / Environment.ProcessorCount * 100); oldProcessorTime = newProcessorTime; // Sleep until next measurement, if it is necessary nextReport += REP_INTERVAL; var timeToSleep = nextReport - Environment.TickCount; if (timeToSleep > 0) Thread.Sleep(timeToSleep); } }
Output result pretty similar to Task Manager - Details:
CPU 6.640625% CPU 5.371094% CPU 5.078125% CPU 5.371094% CPU 7.324219%
Three votes for 7%
against two votes for different sources, maybe we can stop here?
Of course not, Sysinternals Process Explorer has too strong credibility and it has 5 votes and can beat it!
At least some googling shows that not everything is fine with GetProcessTimes
and performance counters:
The author of Why GetThreadTimes is wrong? shows background of
GetProcessTimes/GetThreadTimes
and that values of that function can be less than real time which CPU spend on a process.-
The author of How to get CPU usage of processes and threads have faced with same problem and tries to solve that with some kind of magic: because value of error should be quite evenly distributes among all process so we can calculate "real cost of 1% of CPU usage" for whole CPU and apply correction to calculated value. Anyway, I think that it is a dead end:
It is too magical even for me.
The original code shows a crazy value for testes application
-1057%
😱.Value of the performance counter for the whole processor (
Processor information - % Processor Time - _Total
) is still smaller than value for only one process in Task Manager - Processes, so that kind of such magic will not give any results here.
So, even if Task Manager - Processes overestimate CPU ratio and shows incorrect value, the correct one should be bigger than a value based on GetProcessTimes
and Sysinternals Process Explorer should be very close to it from that prospective too.
Did I Ever Tell You the Definition of "Insanity"?
OK, we already now that GetThreadTimes
lies with intention to economize an usage of system resources.
Do we have another alternative to calculate real CPU time when we can afford it?
Yeeees, because Windows provides a very efficient tracing facility - ETW:
Event Tracing for Windows (ETW) is a high speed tracing facility built into Windows. Using a buffering and logging mechanism implemented in the operating system kernel, ETW provides an infrastructure for events raised by both user mode (apps) and kernel mode components (drivers). ETW can be used for system and app diagnosis, troubleshooting, and performance monitoring.
This high-efficient event log facility:
contains information almost about all activities in a lot of applications, and what's important for us, in CLR Runtime and Windows Kernel;
has outstanding performance and time measurement accuracy;
can be easily used (consumed) in user-applications;
it's so high-speed and cool that it even contains information about all Thread CSwitch Events!!!
And as I said before, you can start using this power usage very easily:
watch an awesome presentation Building your own profiling and diagnosis tools with Event Tracing for Windows from Pavel Yosifovich who is an expert of Windows Internals and even has written a book about it.
read documentation... throw it, look at sample for the awesome C# library Tracing.TraceEvent which is described in the presentation above and is the basis for very powerful CPU and memory performance-analysis tool PerfView.
It can be a little scary that there are a lot of events but as my experiment shows even very stupid exhaustive code like this can handle them:
class ThreadInfo { public double TotalTime; public double StartTime; } static void Main(string[] args) { const int REP_INTERVAL = 1000; const double TOLERANCE = 0.0001; // Find PID of interested process var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server"); if (serverProcess == null) return; var serverPid = serverProcess.Id; var nextReport = Environment.TickCount + REP_INTERVAL; var threadInfos = new Dictionary<int, ThreadInfo>(); // Create ETW session using (var session = new TraceEventSession("MySimpleSession")) { // "Subscribe" to the following thread events: ThreadStart, ThreadStop, ThreadCSwitch session.EnableKernelProvider(KernelTraceEventParser.Keywords.Thread | KernelTraceEventParser.Keywords.ThreadTime); // Sometimes thread can be stopped without switching from it, so handle it session.Source.Kernel.ThreadStop += e => { if (e.ProcessID == serverPid && threadInfos.TryGetValue(e.ThreadID, out var info)) { if (info.StartTime > TOLERANCE) info.TotalTime += e.TimeStampRelativeMSec - info.StartTime; info.StartTime = 0.0; } }; // "Heart" of logic session.Source.Kernel.ThreadCSwitch += e => { // Filter bunch of non-relevant events if (e.ProcessID != serverPid && e.OldProcessID != serverPid) return; // If it is switch from our process if (e.OldProcessID == serverPid && threadInfos.TryGetValue(e.OldThreadID, out var info)) { // If a thread was active, increase total time on latest quantum of time if (info.StartTime > TOLERANCE) info.TotalTime += e.TimeStampRelativeMSec - info.StartTime; // And mark a thread as inactive info.StartTime = 0.0; } // If it is switch to out process if (e.ProcessID == serverPid) { // Create new threadInfo item if is necessary if (!threadInfos.TryGetValue(e.ThreadID, out info)) { info = new ThreadInfo(); threadInfos[e.ThreadID] = info; } // And mark it as active info.StartTime = e.TimeStampRelativeMSec; } if (Environment.TickCount > nextReport) { // When it is time to report var procTime = 0.0; foreach (var entry in threadInfos.Values) { // Because some threads can be active at report time, split this time into two parts if (entry.StartTime > TOLERANCE) { // For the previous interval entry.TotalTime += e.TimeStampRelativeMSec - entry.StartTime; // And for the next one entry.StartTime = e.TimeStampRelativeMSec; } // Sum all thread times from latest report procTime += entry.TotalTime; // Clean state entry.TotalTime = 0.0; } Console.WriteLine("Processor {0}%", procTime / REP_INTERVAL / Environment.ProcessorCount * 100); nextReport += REP_INTERVAL; } }; // Start ETW event-loop session.Source.Process(); } }
Yes, there is a little bit more code than in previous example but... Dammit, it is an additional vote for Sysinternals Process Explorer!
Threads Time 1403.24430000104 Processor 8.77027687500649% Threads Time 1498.37270000152 Processor 9.3648293750095% Threads Time 1512.49889999974 Processor 9.4531181249984% Threads Time 1361.1477999998 Processor 8.50717374999876%
On the one side, it is an awesome result as we have found a way to calculate CPU Time with accuracy of Sysinternals Process Explorer and it has already 6 votes against 3 but... On the other side, it's time to start thinking that I do everything incorrectly.
CPU Heater Ratio vs CPU Bottleneck Ratio
OK, 10% of CPU Usage is a good enough result for an application which has almost exclusive access to all hardware and share it only with a few small apps like SQL Server. But what does 10% of CPU Usage mean for a modern processor? Just imagine a very simple processor with 16 cores, a very simple OS task scheduler which has quantum resolution 62.5ms and can assign only 16 tasks per core in a second. Imagine two different simple applications which are run on that CPU and OS with different models of a task distribution:
I made them so simple on purpose, in real applications this situation can be more complicated and can even lead you into dreaming that everything is fine.
The main idea behind that apps is that both applications has the same CPU Ratio 6.25%
but demonstrate two opposite cases:
Application "I do some work from time to time and can distribute it very efficiently". In the best case, you can multiply loading of this application by 16 and everything will continue to work fine.
Application "I work as hard as I can, every additional piece of work will lead to ignitions of server hardware and a pair of engineers 😫!!!". This application especially has some big problems but you will not see them on a classic CPU Ratio metric.
So, I have made a few experiments... some of them were broken in some cases (app 3 below) but I could formulate a new good enough metric of CPU usage which is more appropriate for the tested application (in comparison with the old one):
-
CPU Heater Ratio or classic CPU Ratio. The formula was already above:
ratio = (total_process_time[N] - total_process_time[N-1]) / interval / cpu_count * 100
-
CPU Bottleneck Ratio or CPU Antiratio (remember when you are):
antiratio = (time when at least one thread working) / interval * 100
That formula can be a little not intuitive so let's look at other samples and collect all values on ratio
and antiratio
in one table:
ratio |
antiratio |
|
---|---|---|
app 1 |
6.25% |
6.25% |
app 2 |
6.25% |
100% |
app 3 |
6.25% |
100% |
app 4 |
28.12% |
50% |
This table obviously shows that antiratio is much more useful to reveal applications of second time than a classic ratio. It can be not so ideal and most probably deeper analysis will be necessary when any problem happens but it is still very good for overall understanding what happens with your process and how close it is to fail and can be used for a constant monitoring of your application.
To implement it let's introduce a new term "simultaneity" - amount of simultaneous threads for some moment of time. To be more clear image which demonstrate it:
Sorry, 10
and 14
leads to float all pictures so read for A
and D
, just like usual decimals.
It can be represented in one dimensional array and base on it we efficiently can calculate both ratios:
static void Main(string[] args) { const int REP_INTERVAL = 1000; // Find PID of interested process var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server"); if (serverProcess == null) return; var serverPid = serverProcess.Id; var nextReport = Environment.TickCount + REP_INTERVAL; var concurrency = new double[Environment.ProcessorCount + 1]; var concurrencyLast = 0.0; var concurrencyCounter = 0; // Create an ETW session using (var session = new TraceEventSession("MySimpleSession")) { // "Subscribe" to the following thread events: ThreadStart, ThreadStop, ThreadCSwitch session.EnableKernelProvider(KernelTraceEventParser.Keywords.Thread | KernelTraceEventParser.Keywords.ThreadTime); void SwitchOn(double timestamp) { // When we switch to process thread increase value of old simultaneity time concurrency[concurrencyCounter] += timestamp - concurrencyLast; // Reset timestamp of latest latest event concurrencyLast = timestamp; // And switch to new simultaneity value concurrencyCounter += 1; } void SwitchOff(double timestamp) { // When if we switch to process thread increase value of old simultaneity time concurrency[concurrencyCounter] += timestamp - concurrencyLast; // Reset timestamp of the latest event concurrencyLast = timestamp; // And switch to a new simultaneity value... with a bit of magic which is similar to // concurrencyCounter = concurrencyCounter > 0 ? concurrencyCounter - 1 : 0;, but it is a little bit faster concurrencyCounter = concurrencyCounter + (-concurrencyCounter >> 31); } // Sometimes thread can be stopped without switching from it, so handle it session.Source.Kernel.ThreadStop += e => { if (e.ProcessID == serverPid) SwitchOff(e.TimeStampRelativeMSec); }; session.Source.Kernel.ThreadCSwitch += e => { // Filter bunch of non-relevant events if (e.ProcessID != serverPid && e.OldProcessID != serverPid) return; if (e.OldProcessID == serverPid) SwitchOff(e.TimeStampRelativeMSec); if (e.ProcessID == serverPid) SwitchOn(e.TimeStampRelativeMSec); if (Environment.TickCount > nextReport) { var procTimeTotal = 0.0; var procTimeTotalAnti = 0.0; for (var i = 1; i < concurrency.Length; ++i) { // For ratio summarize it with keeping simultaneity in mind procTimeTotal += i * concurrency[i]; // For antiratio summarize it in a "flat" way procTimeTotalAnti += concurrency[i]; } // Divide by Environment.ProcessorCount for ratio Console.WriteLine("Processor {0}%", procTimeTotal / REP_INTERVAL / Environment.ProcessorCount * 100); // And ignore Environment.ProcessorCount for antiratio Console.WriteLine("ProcessorAnti {0}%", procTimeTotalAnti / REP_INTERVAL * 100); // Reset state Array.Clear(concurrency, 0, concurrency.Length); nextReport += REP_INTERVAL; } }; // Start ETW event-loop session.Source.Process(); } }
This algorithm is formed the basis of my project antimetrics which also contains other useful metrics for profiling and monitoring network multi-thread applications.
Conclusion
I absolutely had no idea what I walked into a few days ago but I'm glad that I've found at least something. It should teach me that not every task can be solved in a 15-minute interval and small creatures can have MvP behind them but right now I need to go to find other useful metrics and ways to monitor our application. So, it is a good moment to unsubscribe from the blog updates, leave a comment that the whole post is a bullshit or at least downvote the post. And as always, happy hacking!
Comments
Comments powered by Disqus