If you noticed a weirly performing Linux server, what things are you going to check in the first minute after jumping on the server? Assuming it’s not soooo bad which is denying all incoming requests. :)
If you are alerted by some preestablished metric thresholds, like “low disk space”, “low free mem” etc. The best practice is probably just fill in the prescription: throw out some logs or kill some long running processes. Here I want to talk about more debugging checks, and gives you a quick understanding of what is happening on the box at the moment.
Among so many metrics, what are most important ones to look at? Or which ones can deliver most information you need about a Linux server. Brendan Gregg proposed a method called USE (Utilization, Saturation, Errors) in his book << System Performance: Enterprise and the Cloud >>. The key concept of this method is: For every resource, check utilization, saturation, and errors(0).
- Resource: all physical server functional components (CPUs, disks, busses, …)
- Utilization: the average time that the resource was busy servicing work
- Saturation: the degree to which the resource has extra work which it can’t service, often queued
- Errors: the count of error events
A very light weight command that quickly shows you how many long the system has been running, and load average.
Load Average is very important 3 numbers that show system load over the past 1, 5 and 15 minutes accordingly. It calculates by from 0.00 to 1.00 or higher, where 0.00 means no load and 1.00 means full load capacity, per core of you CPU.
For the above example, my system load is 1.61 duraing the past 1 minute, which is not at capacity because it’s a dual-core system(capacity is 2 in this case).
~$ vmstat 1
1 means pring out a snapshot every second.
someans swap mem in/out, if they are not 0, it means no free memory is available right now.
rmeans # of threads waiting for run time, so r > # of CPU means load is beyond capacity.
idmeans CPU idle time;
symeans user CPU time and system CPU time respectively;
stmeans steal time, it happens when VM’s hypervisor is taking current system’s CPU and use it elsewhere (other VM etc).
CPU idle and user time + system time both tells you if CPU is busy, but us (user time) and sy (system time) can give you more info, for example, if sy is much higher than us, it possiblely means system is stucking on I/O.
~$ mpstat -P ALL 1
Similar metrics with
vmstat, but display stat per core so you can check if each core is load balanced.
~$ iostat -dmx 5
This is a very useful command because you are likely to see I/O bound very often.
-d means only display device utilization (exclude CPU),
-m means in megabyte format,
-x means extra stats (always good :).
wMB/smeans # of R/W requests, and size of R/W sent to CPU per second.
avgqu-szshows # of requests wait in device’s queue. If this is greater than # of cores, it means your system is overloaded.
%utilis the % of CPU time when requests were issued. When it becomes 100%, the device is saturated.
~$ dmesg | tail
This would display last 10 system message, helpful on finding possible system failures.
Everyone knows to check IP address with this command, but few knows it also shows your # of error and dropped packets.
~$ ifconfig | grep error
RX packets:262418135 errors:0 dropped:0 overruns:0 frame:0
TX packets:53768807 errors:0 dropped:0 overruns:0 carrier:0
RX packets:6416646 errors:0 dropped:0 overruns:0 frame:0
TX packets:6416646 errors:0 dropped:0 overruns:0 carrier:0
~$ free -m
While you can retrive memory usage via a lot ways,
free can quickly show you stats of
- A buffer is a temporary location assigned to an application that’s used by system, it buffers data pior to I/O;
- Unlike buffer, cache is used for filesystems and store frequently used data from your disk for fast access, similar idea like LRU in many software caching application.
If both values are reaching 0, it means system is currently under high I/O usage, which can be verified by
More about free see this post at linuxnix.
~$ netstat -tulpn
There is a lot arguments, but each has its meaning.
-tu means show TCP and UDP only, because those are what you are looking for most of time;
-l means display listening sockets only;
-p will display PID/Program of the application that owns the port, also keep in mind that you mind need sudo access to view this column. Above result was ran by me (user), which shows no Program name, but can be viewed by super user:
~$ sudo netstat -tulpn | grep :3306
tcp 0 0 :::3306 :::* LISTEN 22571/mysqld
Now it shows 3306 is listened by mysql.
See more about netstat at this post at cyberciti.
top. IMO this is the most powerful tool that gives you a lot information about your linux server, and it probably covered/overlapped most of the commands output we listed above. Unlike
top, which is like printing snapshot of a collection your system stats, commands like
iostat can roll out stats constantly so you can view some metric across a duration, for debugging/logging purposes or what not. Furthermore, with the combination of awk/sed etc, you will be able to do more cool things.
You can do a lot of things with
top, but usually I use it when I am trying to get a high level idea of resource utilization of each application.
Conclusion, I often find it’s overwhelming to learn so many different Linux shell commands, among which overlaps with each other’s functionality sometimes too. I find it’s useful to learn them based on what you want to do, and learn them with specific purpose, check out this post from Brendan Gregg to find out debugging different system component with best shell commands.