A while back I wrote a post Analyzing Linux System Performance and Finding Bottlenecks. I did’t really give a good explanation of determining if you are CPU bound or not so I am writing this post to clear that up.

As I noted previously the sysstat package can provide a wealth of information. One of my favorite utilities is sar. By default sar outputs a CPU utilization report. Lets take a look at some example output.

07:55:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
08:05:01 AM       all      3.95      0.00      2.22      0.13      0.00     93.70
08:15:01 AM       all      4.32      0.00      2.37      0.08      0.00     93.23
08:25:01 AM       all      5.70      0.00      2.79      0.06      0.00     91.44
08:35:01 AM       all      6.13      0.00      3.85      0.07      0.00     89.94
08:45:01 AM       all     10.27      0.00      6.07      0.09      0.00     83.56
08:55:01 AM       all     13.50      0.00      8.15      0.21      0.00     78.14
09:05:01 AM       all     14.53      0.00     10.39      0.30      0.00     74.78
09:15:01 AM       all     12.24      0.00      9.02      0.44      0.00     78.30
09:25:01 AM       all     12.66      0.00      8.80      0.50      0.00     78.03
09:35:01 AM       all     12.67      0.00      9.43      0.49      0.00     77.40
09:45:01 AM       all     13.51      0.00      9.62      0.44      0.00     76.43
09:55:01 AM       all     13.67      0.00     10.66      0.59      0.00     75.08
10:05:01 AM       all     14.47      0.00     10.99      0.66      0.00     73.88
10:15:01 AM       all     12.32      0.00      9.15      0.44      0.00     78.09
10:25:01 AM       all     25.71      0.00     14.17      0.51      0.00     59.61
10:35:01 AM       all     13.65      0.00     10.04      1.30      0.00     75.01
10:45:01 AM       all     12.36      0.00      8.86      0.65      0.00     78.12
10:55:03 AM       all     25.34      0.00     19.41      0.56      0.00     54.69
11:05:01 AM       all     24.10      0.00     19.04      0.67      0.00     56.19
11:15:01 AM       all     16.63      0.00     12.51      0.60      0.00     70.26
11:25:01 AM       all     25.83      0.00     22.10      1.73      0.00     50.34
11:35:01 AM       all     22.80      0.00     16.90      1.06      0.00     59.24
11:45:01 AM       all     31.48      0.00     21.74      1.08      0.00     45.69
11:55:01 AM       all     18.10      0.00     13.53      0.82      0.00     67.55
12:05:02 PM       all     19.07      0.00     14.74      0.94      0.00     65.26
12:15:01 PM       all     20.48      0.00     16.32      1.00      0.00     62.19
12:25:01 PM       all     23.83      0.00     20.03      0.80      0.00     55.33
12:35:01 PM       all     22.97      0.00     18.57      1.43      0.00     57.03
12:45:02 PM       all     25.65      0.00     20.55      0.67      0.00     53.12

The quickest way to see if you need more horsepower is by checking out the idle column. You can see in my example output that as the work day starts the server starts to get a bit stressed, and as the day rolls and we get to midday ive only got about 50% idle.

You might think that is a bit low. Really though most servers are underutilized. I’m sure you’ve heard it before but over 80% of production servers run at less than 20% capacity. That is a good reason to look into server consolidation with a virtualization platform perhaps something like Xen.

Since this is a virtual machine and I have spare cores that could be assigned lets keep investigating if I have a CPU bottleneck.

We have already established that an idle time of 50% isn’t too bad. An idle time habitually less than 20% indicates that the system will not be able to handle a much higher load. If your idle time is regularly less than 5% you might need to add some processing power.

A low idle time in and of itself does not mean you need shiny new cpus (or to add additional virtual cpus in this case). in addition to idle you should look at several other metrics to get a clear picture. First off if your iowait is high you can stop looking for a cpu bottleneck until you have tracked your io bottleneck (faster hard drives, more drives in your raid, move disk heavy service to faster box). iowait higher than about 15% likely indicates an io bottleneck (I have seen some poorly neglected servers with iowaits in the high 60%s. Boy were they pokey, especially the IMAP server that was in that state.)

Moving on past the quick iowait check. Even if your idle time is really low you can check to see if the run queue is heavy or not with sar -q.

07:55:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
08:05:01 AM         1        52      0.08      0.04      0.00
08:15:01 AM         1        53      0.06      0.02      0.00
08:25:01 AM         1        64      0.00      0.00      0.00
08:35:01 AM         1        55      0.09      0.05      0.01
08:45:01 AM         1        52      0.06      0.07      0.01
08:55:01 AM         1        52      0.00      0.00      0.00
09:05:01 AM         1        54      0.01      0.05      0.01
09:15:01 AM         0        53      0.01      0.06      0.01
09:25:01 AM         1        55      0.01      0.03      0.00
09:35:01 AM         1        54      0.04      0.05      0.00
09:45:01 AM         1        53      0.00      0.00      0.00
09:55:01 AM         1        80      0.16      0.08      0.01
10:05:01 AM         1        61      0.07      0.06      0.00
10:15:01 AM         3        62      0.06      0.10      0.05
10:25:01 AM         1        55      0.06      0.13      0.10
10:35:01 AM         1        59      0.17      0.12      0.09
10:45:01 AM         1        55      0.00      0.01      0.04
10:55:01 AM         1        74      0.23      0.17      0.09
11:05:01 AM         1        61      0.08      0.07      0.07
11:15:01 AM         1        56      0.16      0.09      0.07
11:25:01 AM         0        53      0.12      0.12      0.08
11:35:01 AM         1        53      0.01      0.04      0.06
11:45:01 AM         1        59      0.01      0.06      0.07
11:55:01 AM         1        60      0.11      0.08      0.07
12:05:01 PM         1        79      0.30      0.16      0.11
12:15:01 PM         1        63      0.03      0.07      0.08
12:25:01 PM         1        71      0.10      0.06      0.07
12:35:01 PM         1        58      0.01      0.03      0.03
12:45:01 PM         1        66      0.07      0.08      0.03

Here is the run queue for the same time period. We are looking for runq-sz (number of tasks waiting for run time) consistently greater than 2. Well in my case I look ok. Im starting to see an increase in demand for cpu time but its being handeled well. But since I have some extra cores and since I know that this server has not seen the busy part of the year I will probabbly attach an extra vcpu so I have less to worry about in the comming months.

Whew took a bit to spit that out, hope you enjoyed!

Note:

A friend of mine (thanks Scott) commented on something that I neglected to point out. You might want to do a bit of application profiling. Something like a database that has not had optimize tables or vacuum run on it in a long time may present as high iowait. Moving the service might very well alleviate your problem because in the act of moving it you may defrag the database when putting it on the shiny new hardware. So when looking for performance bottle necks metrics cant really tell you if some service or code you have written is just inefficient, either for poor code smells or for lack of maintenance. Just be careful to not sell the farm for a new server when the old one still works just fine.