Troubleshooting CPU problems in production for the cloud

Diagnosing and troubleshooting CPU problems in production for a cloud environment can be both tricky and tedious. Your application might have millions of lines of code, so trying to identify exact line of code that is causing the CPU to spike up is basically the equivalent of finding a needle in a haystack. In this article, we’ll learn how to find that needle in matter of seconds.

To help readers better understand this troubleshooting technique, we built a sample application and deployed it in an Amazon Elastic Compute Cloud (aka Amazon EC2) instance. Once this application was launched, it caused the CPU consumption to spike up 199.1%. Now, let’s walk through the steps that we followed to troubleshoot this problem. Basically, there are three simple steps:

Identify the threads that consume CPU
Capture thread dumps
Identify the lines of code that are causing the CPU to spike up

Let’s dive right in!

1. Identify the threads that consume CPU

In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. The best way to do is to use the TOP command that is present in *nix flavor of operating systems.

Issue command top from the console:

$ top

This command will display all the processes that are running in the Amazon EC2 instance, sorted by high CPU consuming processes displayed at the top. When we issue the command in the Amazon EC2 instance, we get the following output:

Fig: ‘top’ command issued from an AWS EC2 instance

From the output, you should notice that process#31294 is consuming 199.1% of the CPU. That’s a pretty high consumption. So, now we have identified the process in the Amazon EC2 instance that is causing the CPU to spike up. The next step is to identify the threads in this process that are causing the CPU to spike up.

Issue command top -H -p {pid} from the console. For example:

$ top -H -p 31294

This command will display all the threads are causing the CPU to spike up in this particular #31294 process. When we issued this command in the Amazon EC2 instance, we see the following output:

Fig: top -H -p {pid} command issued from an AWS EC2 instance

From this output, you should notice that:

Thread ID #31306 consumes 69.3% of CPU
Thread ID #31307 consumes 65.6% of CPU
Thread ID #31308 consumes 64.0% of CPU

The remaining threads all consume a negligible amount of CPU.

This is good step forward, as we have identified the threads that are causing CPU to spike. In the next step we, need to capture thread dumps to identify the lines of code that are causing the CPU to spike up.

2. Capture thread dumps

A thread dump is a snapshot of all threads that are present in the application. A thread dump reports things like the thread state, stacktrace (i.e. code path that thread is executing), and the thread ID-related information of every thread in the application.

There are eight different options to capture thread dumps. You can choose whichever option that is convenient for you. One of the simplest options for capturing a thread dump is to use tool jstack which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Here’s the command to capture thread dump:

jstack -l {pid} > {file-path}

Where pid is the process ID of the application, whose thread dump should be captured and file-path is the file path where thread dump will be written in to.

For example, in the example below, the dump of the process would be generated in /opt/tmp/threadDump.txt file.

jstack -l 31294 > /opt/tmp/threadDump.txt

3. Identify lines of code that are causing the CPU to spike up

The next step is to analyze the thread dump to identify the lines of code that are causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now, we upload the captured thread dump to the fastThread tool. This tool generates a beautiful visual report with multiple sections. There is a search box on the top right corner of the report. We can enter the IDs of the threads that have been consuming a high amount of CPU, i.e., the thread IDs that we identified in step #1. In this case, that would be #31306, #31307, and #31308.

Here’s how the fastThread tool displayed the three threads stack trace:

Fig: FastThread tool displaying CPU consuming thread.

You can notice the three threads to be in RUNNABLE state and executing this line of code:

com.buggyapp.cpuspike.Object1.execute(Object1.java:13)

The following is the application source code:

package com.buggyapp.cpuspike;

/**
* 
* @author Test User
*/
public class Object1 {
	
	public static void execute() {
		
		while (true) {
		
			doSomething();
		}		
	}
	
	public static void doSomething() {
		
	}
}

You can see line #13 in object1.java is doSomething();. You can see that doSomething() method does nothing. However, it is invoked an infinite number of times because of a non-terminating loop in line #11. If a thread starts to loop an infinite number of times, then the CPU will start to spike up. That is what exactly happening in this sample program. If the non-terminating loop in line #11 is fixed, then then this CPU spike will go away.

Conclusion

So, if you are troubleshooting a CPU problem while in production, there are a few simple things to do. First, utilize the TOP tool to identify the thread IDs that are causing the CPU spike up. Then, capture the thread dumps. Finally, analyze the thread dumps to identify the exact lines of code that are causing the CPU to spike up. Enjoy troubleshooting, happy hacking!

The post Troubleshooting CPU problems in production for the cloud appeared first on JAXenter.

Source : JAXenter

Troubleshooting CPU problems in production for the cloud

Meet us in London: The Conference for Continuous Delivery, Microservices, Clouds & the Kubernetes Ecosystem

Beyond Continuous Delivery: Learn, adapt, improve

Containers on AWS: What to use and when?

Workshop: From zero to Continuous Integration and Continuous Delivery

1. Identify the threads that consume CPU

SEE ALSO: StackOverFlowError: Causes & solutions

2. Capture thread dumps

SEE ALSO: Turbo charge CPU utilization in Fork/Join using the ManagedBlocker

3. Identify lines of code that are causing the CPU to spike up

SEE ALSO: Meet Osaka, a Rust async for explicit, well-defined code that doesn’t take up too much space

Conclusion

You may also like...

Random Post

Recent

Troubleshooting CPU problems in production for the cloud

Meet us in London: The Conference for Continuous Delivery, Microservices, Clouds & the Kubernetes Ecosystem

Beyond Continuous Delivery: Learn, adapt, improve

Containers on AWS: What to use and when?

Workshop: From zero to Continuous Integration and Continuous Delivery

1. Identify the threads that consume CPU

SEE ALSO: StackOverFlowError: Causes & solutions

2. Capture thread dumps

SEE ALSO: Turbo charge CPU utilization in Fork/Join using the ManagedBlocker

3. Identify lines of code that are causing the CPU to spike up

SEE ALSO: Meet Osaka, a Rust async for explicit, well-defined code that doesn’t take up too much space

Conclusion

You may also like...

Father Duffy’s story by Francis P. Duffy

Congress saw some Nickelback hot takes on the House floor

Do UK ISPs Have Permission to Monitor IPTV Pirates & Share Their Data?

Random Post

Recent

Tags