1 Mar 2015

Debugging slow systems

Most of this post is blindingly obvious since it’s the scientific method. And quite frankly, I wouldn’t be writing it if someone didn’t ask me why something was slow. Most of the time this is a game of whack-a-mole, especially, if you don’t got something like newrelic installed. A better question to ask is not why is something slow, but when will something become slow? So let’s start with a basic, “OMG, the website is slow!” This deserves a reference to the immortal, “The Website is Down!”

The user

Well, it started with a person’s browser. So the first point to find out is if it’s re-creatable. Points to check:

DNS queries are slow
Connection bandwidth and latency (http://speed.io and ping)
A browser/cpu that has gone crazy ( Javascript or other processes )

OK, let’s say everything checks out and there is some slowness. What is the exact page/query that is slow? Do you know where it occurs in your stack (assuming a standard stack of web, app, db, search). Is it a particular feature or page or general slowness? If you know it’s a feature, you can go straight to profiling that. Hopefully, that is what happens most of the time. But then you still would need to debug how that query in whatever stack you’re looking at, so it’s essentially the same methodology. It will most likely be one of the following in your stack whether you’re looking at a database, an elasticsearch cluster, or an application server:

CPU
Disk IO
Network IO
Network latency
Memory Usage

Now hopefully, you have some basic monitoring on all these systems through something like statsd or otherwise. The first thing to check are the basic server statistics to see if there is anything out of the ordinary whether you know where to look or not.

Disk IO

So how do you know your IO is your problem? Easiest method check top. It will give a nice line like:

%Cpu(s):  0.0 us,  0.5 sy,  3.7 ni, 72.8 id,  25.0 wa,  0.0 hi,  0.0 si,  0.0 st

That “25% wa” is iowait. It means something is hitting the disk and waiting for it. It will give the process chewing up the cpu, so hopefully that can isolate the process. But if not, there’s always sysdig.

See the top processes in terms of disk bandwidth usage:

sysdig -c topprocs_file

See the top files in terms of read+write bytes:

sysdig -c topfiles_bytes

This will tell you what is using the disk, but not why since that would be more application specific.

You can also use something like iostat -xz, “%util” or iostat -xz, “%user”

Now since we’re already down the rabbit hole, there’s also capacity planning. How much can the disk subsystem handle is defined mostly by the IOPS or IO operations per second. The theoretical IOPS for a spinning disk is the based on the seek time + latency. So for a 3 ms latency and a 4 ms seek time, that would be ¹⁄₀.007 = 140 IOP’s. I have no idea how you’d calculate this for SSD disks, but hey, there’s wikipedia entry for that . And if you want much more detailed calculations of IOP’s, here’s another link . So back to the point of capacity planning, your current IOP’s are shown in iostat by the tps (transactions per second) column.

iostat
Linux 3.13.0-46-generic (ubuntu)        05/01/2015      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    5.05    1.01    0.06    0.00   93.86

           Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
           sda              12.03       397.29       153.71  105857517   40956996

After this point, it all depends on the application load profile to know when the disks are going to collapse, which is a much trickier matter to deduce.

And finally, if you absolutely know that there is a problem with the disks, some freezeups or what not, it could be an imminent failure or a driver issue. Check dmesg or smartctl for errors.

Memory

The main two commands are top and free.

$ free -m
                          total       used       free     shared    buffers     cached
             Mem:          2993       2858        134         10          0        341
             -/+ buffers/cache:       2516        476
             Swap:         1159        164        995

Some main points here, used memory is good. In the above example, there is 3 GB of memory and 2858 MB being used. However, the actual available memory to programs is given by the second line -+ buffers/cache. So instead of 134 MB being free, there’s 476 MB free. If the application gets into the swap usage, that’s a different story. In this case it’s at 164 MB. You can find if it’s actually being used by checking vmstat (swap in/swap out) lines.

What processes hitting swap? Although, one should never hopefully hit swap, maybe we don’t care for some reason.

sudo sysdig -S -s 100 -p "%proc.name %proc.vmswap" "proc.vmswap>100" |head -n 1000|sort|uniq
firefox 5516
JS 5516
Media 5516
mosh-client 528
mosh-client 748
pcscd 404
Socket 5516
Timer 5516
tmux 12148
vmtoolsd 1036
vmtoolsd 2808
xfce4-terminal 5676
Xorg 14028

Now if you sort top by memory usage, you might get some lines like this.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    43873 tjheeta   25   5 2408116 1.403g  27404 S   1.7 48.0 122:29.71 firefox

What actually is being used is resident memory (RES). The virtual memory is a ton, but that don’t mean much. Linux allows you to malloc whatever you want, it’s the usage that counts. The memory actually doesn’t get allocated until it’s used. A complete and total digression for the curious:

#include <time.h>
#include <stdio.h>
int main() {
  char * mem;
  int i;
  int allocsize = 1000000000;
  mem = malloc(allocsize);
  for (i=0; i<allocsize; i++) {
     *(mem+i)=1;
    usleep(100);
  }
}

The resident memory usage will slowly grow to 1 GB on the process and those 3 processes have more memory than the system itself.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10583 tjheeta   25   5  980756    612    408 R   2.9  0.0   0:03.78 a.out
10758 tjheeta   25   5  980756    356    276 S   2.9  0.0   0:00.42 a.out
 9990 tjheeta   25   5  980756    872    408 S   4.9  0.0   0:07.88 a.out

Back to the point, that should give some understanding of when memory is utilized and since we want to avoid swap at all costs, that is the metric to be used when doing capacity planning. Other types of memory failures could include the OOM killer, so check dmesg if a process disappears or happens to restart for no apparent reason. The apparent slowness could just be an application restarting because of OOM, so it is good to monitor dmesg for that.

Network

Network gets a bit trickier, especially in cloud platforms. Network latency problems might be intermittent, switches may be overcapacity and you wouldn’t know.

So the first thing is to check the utilization. We can use iotop, but here is the relevant sysdig command:

sudo sysdig -c iobytes_net    
09:39:58 in:218 out:221 tot:439
09:39:59 in:952 out:227 tot:1179
09:40:00 in:75 out:6 tot:81
09:40:01 in:484 out:317 tot:801
09:40:02 in:225 out:222 tot:447
09:40:03 in:80 out:0 tot:80
09:40:04 in:703 out:308 tot:1011

In fact, there are some more handy chisels for sysdig if you want to go deeper into what is using the bw:

sudo sysdig -c topports_server
sudo sysdig -c topprocs_net

Monitoring the overall bandwidth used is the main item on the list, but it’s important to check the network interfaces for errors. These can be found be the errors, dropped, overruns, and collisions.

$ ifconfig lxcbr0
lxcbr0    Link encap:Ethernet  HWaddr 4a:95:66:1a:79:ed  
          inet addr:10.0.3.1  Bcast:10.0.3.255  Mask:255.255.255.0
          inet6 addr: fe80::4895:66ff:fe1a:79ed/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:146 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:6444 (6.4 KB)

Now normally, I would recommend checking all the switches, too, but in a cloud environment, you can’t do that. But installing smokeping can help with knowing if/when something when wrong.

CPU

Again, back to ‘top’. It is perhaps handy to know what the columns mean:

us: user - time used by applications
sy: system - time used by the kernel
id: idle - how much of the sstem is idle
wa: wait - waiting on disk
hi: hardware interrupts
si: software interrupts

It’s pretty self-evident when cpu is being used and what process is using it.

Applications

So after checking the basics, which should be available on a panel if you got your monitoring down, what happens next? Well, you got to dig into the actual application behaviour and trace through that. That could be rather tricky especially if you are waiting on external responses. External and data calls should be monitored by a increment and a timer in statsd if you don’t have newrelic. But some basic things to monitor for your stack:

Webserver:

Number of concurrent connections
Open connections
Keepalives

Database:

Slow queries
Queries per second
IOPS
Size of data on disk and in memory

The problem with intermittent errors is that it’s quite difficult to track them down. It could be your log server blocking, the connection pool to your database is not large enough, a badly indexed query that is reading tons of disk, the network link to your DR site is saturated and for some reason things are being rebalanced over there, badly written caches, or just bad code. Hopefully, you’ll get lucky. The scientific method, after all, is a fancy way of saying guess and test.

Nothing interesting...

About