Troubleshooting CPU problems in production for the cloud

Share
  • February 15, 2019

Diagnosing and troubleshooting CPU problems in production for a cloud environment can be both tricky and tedious. Your application might have millions of lines of code, so trying to identify exact line of code that is causing the CPU to spike up is basically the equivalent of finding a needle in a haystack. In this article, we’ll learn how to find that needle in matter of seconds.

To help readers better understand this troubleshooting technique, we built a sample application and deployed it in an Amazon Elastic Compute Cloud (aka Amazon EC2) instance. Once this application was launched, it caused the CPU consumption to spike up 199.1%. Now, let’s walk through the steps that we followed to troubleshoot this problem. Basically, there are three simple steps:

  1. Identify the threads that consume CPU
  2. Capture thread dumps
  3. Identify the lines of code that are causing the CPU to spike up

Let’s dive right in!

1. Identify the threads that consume CPU

In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. The best way to do is to use the TOP command that is present in *nix flavor of operating systems.

Issue command top from the console:

$ top 

This command will display all the processes that are running in the Amazon EC2 instance, sorted by high CPU consuming processes displayed at the top. When we issue the command in the Amazon EC2 instance, we get the following output:

CPU

Fig: ‘top’ command issued from an AWS EC2 instance

From the output, you should notice that process#31294 is consuming 199.1% of the CPU. That’s a pretty high consumption. So, now we have identified the process in the Amazon EC2 instance that is causing the CPU to spike up. The next step is to identify the threads in this process that are causing the CPU to spike up.

Issue command top -H -p {pid} from the console. For example:

$ top -H -p 31294 

This command will display all the threads are causing the CPU to spike up in this particular #31294 process. When we issued this command in the Amazon EC2 instance, we see the following output:

Fig: top -H -p {pid} command issued from an AWS EC2 instance

From this output, you should notice that:

  • Thread ID #31306 consumes 69.3% of CPU
  • Thread ID #31307 consumes 65.6% of CPU
  • Thread ID #31308 consumes 64.0% of CPU

The remaining threads all consume a negligible amount of CPU.

This is good step forward, as we have identified the threads that are causing CPU to spike. In the next step we, need to capture thread dumps to identify the lines of code that are causing the CPU to spike up.

SEE ALSO: StackOverFlowError: Causes & solutions

2. Capture thread dumps

A thread dump is a snapshot of all threads that are present in the application. A thread dump reports things like the thread state, stacktrace (i.e. code path that thread is executing), and the thread ID-related information of every thread in the application.

There are eight different options to capture thread dumps. You can choose whichever option that is convenient for you. One of the simplest options for capturing a thread dump is to use tool jstack which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Here’s the command to capture thread dump:

jstack -l {pid} > {file-path} 

Where pid is the process ID of the application, whose thread dump should be captured and file-path is the file path where thread dump will be written in to.

For example, in the example below, the dump of the process would be generated in /opt/tmp/threadDump.txt file.

jstack -l 31294 > /opt/tmp/threadDump.txt 

SEE ALSO: Turbo charge CPU utilization in Fork/Join using the ManagedBlocker

3. Identify lines of code that are causing the CPU to spike up

The next step is to analyze the thread dump to identify the lines of code that are causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now, we upload the captured thread dump to the fastThread tool. This tool generates a beautiful visual report with multiple sections. There is a search box on the top right corner of the report. We can enter the IDs of the threads that have been consuming a high amount of CPU, i.e., the thread IDs that we identified in step #1. In this case, that would be #31306, #31307, and #31308.

Here’s how the fastThread tool displayed the three threads stack trace:

Fig: FastThread tool displaying CPU consuming thread.

You can notice the three threads to be in RUNNABLE state and executing this line of code:

com.buggyapp.cpuspike.Object1.execute(Object1.java:13) 

The following is the application source code:

package com.buggyapp.cpuspike;

/**
* 
* @author Test User
*/
public class Object1 {
	
	public static void execute() {
		
		while (true) {
		
			doSomething();
		}		
	}
	
	public static void doSomething() {
		
	}
} 

You can see line #13 in object1.java is doSomething();. You can see that doSomething() method does nothing. However, it is invoked an infinite number of times because of a non-terminating loop in line #11. If a thread starts to loop an infinite number of times, then the CPU will start to spike up. That is what exactly happening in this sample program. If the non-terminating loop in line #11 is fixed, then then this CPU spike will go away.

SEE ALSO: Meet Osaka, a Rust async for explicit, well-defined code that doesn’t take up too much space

Conclusion

So, if you are troubleshooting a CPU problem while in production, there are a few simple things to do. First, utilize the TOP tool to identify the thread IDs that are causing the CPU spike up. Then, capture the thread dumps. Finally, analyze the thread dumps to identify the exact lines of code that are causing the CPU to spike up. Enjoy troubleshooting, happy hacking!

The post Troubleshooting CPU problems in production for the cloud appeared first on JAXenter.

Source : JAXenter