"Calculation of CPU Utilization" or "What Can be Simpler"?

A few days ago our testing team asked me how they could collect CPU utilization of our apps and then build cool plots based on that information. My answer was simple "Give me 15 minutes and I'll send you back a simple instruction or snippet". After 3 days of thinking and experiments I understood that I knew absolutely nothing about CPU Time utilization, how to calculate it and what it means. Anyway, I have received quite interesting results and ready to share them.

Disclaimer

Like everything in this blog this post can be absolutely cruel and contain terrible things. Some of them can be made in an absolutely incorrect and irrational way and even destroy your brain. If you have another opinion or know a more logical way to do something relevant to the current topic, I will be very glad if you express your thoughts in comments or pull-requests.

Let's Use Standard Tools

Before writing a code it is good habit to know how the task has already been solved by smarter people in other applications. The first problem was: Windows Task Manager, Sysinternals Process Explorer and Windows Performance Counters show absolutely different values:

  • Task Manager - Processes, average value is 10.5%:

    task-manager-1.png
  • Task Manager - Details, average value is 7%:

    task-manager-2.png
  • Sysinternals Process Explorer, average value is 9.5%:

    process-explorer.png
  • Windows Performance Counters, average value is... 106.854%:

    performance-counter.png

Windows Task Manager was even a leader among them because it showed two significantly different values in different tabs... And yes, there is some noise, but I'm trying to be honest and make screenshots for average values.

The most counterintuitive value was shown by Windows Performance Counter. "CPU Time, you should be in range [0, 100], go home, you are drunk!" was my first thought when I saw it. But if we forget that it is percentage and try to understand what can be behind it, to solve this puzzle is quite easy: it is a ratio between the time which process spent on all cores (16) to the interval (1) between measurement, so if we divide it by amount of cores we get 6.67% which is very close to value of Task Manager - Details.

OK, we already have two votes to 7% value and an one of sources can be easily automated. Also, it is the most optimistic value because it is the lowest among all the values, we even get into the dreaming that our app takes so small resources that can be run on a calculator!.. From another side, but my scientific background doesn't allow me to stop here, so continue the experiments. It is good that we already have some background knowledge that CPU ratio calculations should look like:

CPU = (total_process_time[N] - total_process_time[N-1]) / interval / cpu_count * 100

Windows API provides GetProcessTimes function which can inform us about time spent on a process. And it has a very nice managed wrapper Process.TotalProcessorTime. Let's make a small app that can calculate CPU ratio based on that:

static void Main(string[] args)
{
    const int REP_INTERVAL = 1000;

    // Find the process of interest
    var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server");
    if (serverProcess == null)
        return;

    var nextReport = Environment.TickCount + REP_INTERVAL;
    var oldProcessorTime = serverProcess.TotalProcessorTime.TotalMilliseconds;

    for (;;)
    {
        // Calculate delta of Total Process Time between measurements
        var newProcessorTime = serverProcess.TotalProcessorTime.TotalMilliseconds;
        Console.WriteLine("CPU {0}%", (float)(newProcessorTime - oldProcessorTime) /  REP_INTERVAL / Environment.ProcessorCount * 100);
        oldProcessorTime = newProcessorTime;

        // Sleep until next measurement, if it is necessary
        nextReport += REP_INTERVAL;
        var timeToSleep = nextReport - Environment.TickCount;
        if (timeToSleep > 0)
            Thread.Sleep(timeToSleep);
    }
}

Output result pretty similar to Task Manager - Details:

CPU 6.640625%
CPU 5.371094%
CPU 5.078125%
CPU 5.371094%
CPU 7.324219%

Three votes for 7% against two votes for different sources, maybe we can stop here? Of course not, Sysinternals Process Explorer has too strong credibility and it has 5 votes and can beat it! At least some googling shows that not everything is fine with GetProcessTimes and performance counters:

  • The author of Why GetThreadTimes is wrong? shows background of GetProcessTimes/GetThreadTimes and that values of that function can be less than real time which CPU spend on a process.

  • The author of How to get CPU usage of processes and threads have faced with same problem and tries to solve that with some kind of magic: because value of error should be quite evenly distributes among all process so we can calculate "real cost of 1% of CPU usage" for whole CPU and apply correction to calculated value. Anyway, I think that it is a dead end:

    • It is too magical even for me.

    • The original code shows a crazy value for testes application -1057% 😱.

    • Value of the performance counter for the whole processor (Processor information - % Processor Time - _Total) is still smaller than value for only one process in Task Manager - Processes, so that kind of such magic will not give any results here.

So, even if Task Manager - Processes overestimate CPU ratio and shows incorrect value, the correct one should be bigger than a value based on GetProcessTimes and Sysinternals Process Explorer should be very close to it from that prospective too.

Did I Ever Tell You the Definition of "Insanity"?

OK, we already now that GetThreadTimes lies with intention to economize an usage of system resources. Do we have another alternative to calculate real CPU time when we can afford it? Yeeees, because Windows provides a very efficient tracing facility - ETW:

Event Tracing for Windows (ETW) is a high speed tracing facility built into Windows. Using a buffering and logging mechanism implemented in the operating system kernel, ETW provides an infrastructure for events raised by both user mode (apps) and kernel mode components (drivers). ETW can be used for system and app diagnosis, troubleshooting, and performance monitoring.

This high-efficient event log facility:

  • contains information almost about all activities in a lot of applications, and what's important for us, in CLR Runtime and Windows Kernel;

  • has outstanding performance and time measurement accuracy;

  • can be easily used (consumed) in user-applications;

  • it's so high-speed and cool that it even contains information about all Thread CSwitch Events!!!

And as I said before, you can start using this power usage very easily:

It can be a little scary that there are a lot of events but as my experiment shows even very stupid exhaustive code like this can handle them:

class ThreadInfo
{
    public double TotalTime;
    public double StartTime;
}

static void Main(string[] args)
{
    const int REP_INTERVAL = 1000;
    const double TOLERANCE = 0.0001;

    // Find PID of interested process
    var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server");
    if (serverProcess == null)
        return;
    var serverPid = serverProcess.Id;

    var nextReport = Environment.TickCount + REP_INTERVAL;
    var threadInfos = new Dictionary<int, ThreadInfo>();

    // Create ETW session
    using (var session = new TraceEventSession("MySimpleSession")) {
        // "Subscribe" to the following thread events: ThreadStart, ThreadStop, ThreadCSwitch
        session.EnableKernelProvider(KernelTraceEventParser.Keywords.Thread |
                                     KernelTraceEventParser.Keywords.ThreadTime);

        // Sometimes thread can be stopped without switching from it, so handle it
        session.Source.Kernel.ThreadStop += e =>
        {
            if (e.ProcessID == serverPid && threadInfos.TryGetValue(e.ThreadID, out var info))
            {
                if (info.StartTime > TOLERANCE)
                    info.TotalTime += e.TimeStampRelativeMSec - info.StartTime;
                info.StartTime = 0.0;
            }
        };

        // "Heart" of logic
        session.Source.Kernel.ThreadCSwitch += e =>
        {
            // Filter bunch of non-relevant events
            if (e.ProcessID != serverPid && e.OldProcessID != serverPid)
                return;

            // If it is switch from our process
            if (e.OldProcessID == serverPid && threadInfos.TryGetValue(e.OldThreadID, out var info))
            {
                // If a thread was active, increase total time on latest quantum of time
                if (info.StartTime > TOLERANCE)
                    info.TotalTime += e.TimeStampRelativeMSec - info.StartTime;
                // And mark a thread as inactive
                info.StartTime = 0.0;
            }

            // If it is switch to out process
            if (e.ProcessID == serverPid)
            {
                // Create new threadInfo item if is necessary
                if (!threadInfos.TryGetValue(e.ThreadID, out info))
                {
                    info = new ThreadInfo();
                    threadInfos[e.ThreadID] = info;
                }
                // And mark it as active
                info.StartTime = e.TimeStampRelativeMSec;
            }

            if (Environment.TickCount > nextReport)
            {
                // When it is time to report
                var procTime = 0.0;
                foreach (var entry in threadInfos.Values)
                {
                    // Because some threads can be active at report time, split this time into two parts
                    if (entry.StartTime > TOLERANCE)
                    {
                        // For the previous interval
                        entry.TotalTime += e.TimeStampRelativeMSec - entry.StartTime;
                        // And for the next one
                        entry.StartTime = e.TimeStampRelativeMSec;
                    }
                    // Sum all thread times from latest report
                    procTime += entry.TotalTime;
                    // Clean state
                    entry.TotalTime = 0.0;
                }

                Console.WriteLine("Processor {0}%", procTime / REP_INTERVAL / Environment.ProcessorCount * 100);

                nextReport += REP_INTERVAL;
            }
        };

        // Start ETW event-loop
        session.Source.Process();
    }
}

Yes, there is a little bit more code than in previous example but... Dammit, it is an additional vote for Sysinternals Process Explorer!

Threads Time 1403.24430000104
Processor 8.77027687500649%
Threads Time 1498.37270000152
Processor 9.3648293750095%
Threads Time 1512.49889999974
Processor 9.4531181249984%
Threads Time 1361.1477999998
Processor 8.50717374999876%

On the one side, it is an awesome result as we have found a way to calculate CPU Time with accuracy of Sysinternals Process Explorer and it has already 6 votes against 3 but... On the other side, it's time to start thinking that I do everything incorrectly.

CPU Heater Ratio vs CPU Bottleneck Ratio

OK, 10% of CPU Usage is a good enough result for an application which has almost exclusive access to all hardware and share it only with a few small apps like SQL Server. But what does 10% of CPU Usage mean for a modern processor? Just imagine a very simple processor with 16 cores, a very simple OS task scheduler which has quantum resolution 62.5ms and can assign only 16 tasks per core in a second. Imagine two different simple applications which are run on that CPU and OS with different models of a task distribution:

scheduling1.svg

I made them so simple on purpose, in real applications this situation can be more complicated and can even lead you into dreaming that everything is fine. The main idea behind that apps is that both applications has the same CPU Ratio 6.25% but demonstrate two opposite cases:

  1. Application "I do some work from time to time and can distribute it very efficiently". In the best case, you can multiply loading of this application by 16 and everything will continue to work fine.

  2. Application "I work as hard as I can, every additional piece of work will lead to ignitions of server hardware and a pair of engineers 😫!!!". This application especially has some big problems but you will not see them on a classic CPU Ratio metric.

So, I have made a few experiments... some of them were broken in some cases (app 3 below) but I could formulate a new good enough metric of CPU usage which is more appropriate for the tested application (in comparison with the old one):

  1. CPU Heater Ratio or classic CPU Ratio. The formula was already above:

    ratio = (total_process_time[N] - total_process_time[N-1]) / interval / cpu_count * 100
  2. CPU Bottleneck Ratio or CPU Antiratio (remember when you are):

    antiratio = (time when at least one thread working) / interval * 100

That formula can be a little not intuitive so let's look at other samples and collect all values on ratio and antiratio in one table:

scheduling2.svg
ratio vs antiratio comparison

ratio

antiratio

app 1

6.25%

6.25%

app 2

6.25%

100%

app 3

6.25%

100%

app 4

28.12%

50%

This table obviously shows that antiratio is much more useful to reveal applications of second time than a classic ratio. It can be not so ideal and most probably deeper analysis will be necessary when any problem happens but it is still very good for overall understanding what happens with your process and how close it is to fail and can be used for a constant monitoring of your application.

To implement it let's introduce a new term "simultaneity" - amount of simultaneous threads for some moment of time. To be more clear image which demonstrate it:

scheduling3.svg

Sorry, 10 and 14 leads to float all pictures so read for A and D, just like usual decimals. It can be represented in one dimensional array and base on it we efficiently can calculate both ratios:

static void Main(string[] args)
{
    const int REP_INTERVAL = 1000;

    // Find PID of interested process
    var serverProcess = Process.GetProcesses().FirstOrDefault(x => x.ProcessName == "Server");
    if (serverProcess == null)
        return;
    var serverPid = serverProcess.Id;

    var nextReport = Environment.TickCount + REP_INTERVAL;

    var concurrency = new double[Environment.ProcessorCount + 1];
    var concurrencyLast = 0.0;
    var concurrencyCounter = 0;

    // Create an ETW session
    using (var session = new TraceEventSession("MySimpleSession")) {
        // "Subscribe" to the following thread events: ThreadStart, ThreadStop, ThreadCSwitch
        session.EnableKernelProvider(KernelTraceEventParser.Keywords.Thread |
                                     KernelTraceEventParser.Keywords.ThreadTime);

        void SwitchOn(double timestamp)
        {
            // When we switch to process thread increase value of old simultaneity time
            concurrency[concurrencyCounter] += timestamp - concurrencyLast;
            // Reset timestamp of latest latest event
            concurrencyLast = timestamp;
            // And switch to new simultaneity value
            concurrencyCounter += 1;
        }

        void SwitchOff(double timestamp)
        {
            // When if we switch to process thread increase value of old simultaneity time
            concurrency[concurrencyCounter] += timestamp - concurrencyLast;
            // Reset timestamp of the latest event
            concurrencyLast = timestamp;
            // And switch to a new simultaneity value... with a bit of magic which is similar to
            // concurrencyCounter = concurrencyCounter > 0 ? concurrencyCounter - 1 : 0;, but it is a little bit faster
            concurrencyCounter = concurrencyCounter + (-concurrencyCounter >> 31);
        }

        // Sometimes thread can be stopped without switching from it, so handle it
        session.Source.Kernel.ThreadStop += e =>
        {
            if (e.ProcessID == serverPid)
                SwitchOff(e.TimeStampRelativeMSec);
        };

        session.Source.Kernel.ThreadCSwitch += e =>
        {
            // Filter bunch of non-relevant events
            if (e.ProcessID != serverPid && e.OldProcessID != serverPid)
                return;

            if (e.OldProcessID == serverPid)
                SwitchOff(e.TimeStampRelativeMSec);
            if (e.ProcessID == serverPid)
                SwitchOn(e.TimeStampRelativeMSec);

            if (Environment.TickCount > nextReport)
            {
                var procTimeTotal = 0.0;
                var procTimeTotalAnti = 0.0;
                for (var i = 1; i < concurrency.Length; ++i)
                {
                    // For ratio summarize it with keeping simultaneity in mind
                    procTimeTotal += i * concurrency[i];
                    // For antiratio summarize it in a "flat" way
                    procTimeTotalAnti += concurrency[i];
                }

                // Divide by Environment.ProcessorCount for ratio
                Console.WriteLine("Processor {0}%", procTimeTotal / REP_INTERVAL / Environment.ProcessorCount * 100);
                // And ignore Environment.ProcessorCount for antiratio
                Console.WriteLine("ProcessorAnti {0}%", procTimeTotalAnti / REP_INTERVAL  * 100);
                // Reset state
                Array.Clear(concurrency, 0, concurrency.Length);

                nextReport += REP_INTERVAL;
            }
        };

        // Start ETW event-loop
        session.Source.Process();
    }
}

This algorithm is formed the basis of my project antimetrics which also contains other useful metrics for profiling and monitoring network multi-thread applications.

Conclusion

I absolutely had no idea what I walked into a few days ago but I'm glad that I've found at least something. It should teach me that not every task can be solved in a 15-minute interval and small creatures can have MvP behind them but right now I need to go to find other useful metrics and ways to monitor our application. So, it is a good moment to unsubscribe from the blog updates, leave a comment that the whole post is a bullshit or at least downvote the post. And as always, happy hacking!

Comments

Comments powered by Disqus