Wednesday, March 12, 2014

Using Python To Prevent A System Crash

An issue with a critical server had been plaguing the IT department for upwards of a year without any resolution.

The issue stemmed from the server running out of virtual memory thus resulting in a crash.  This would happen every month or so.  The IT staff on board at the time were unable to find the cause of the issue and resorted to rebooting the server during the monthly maintenance window as a fix.

The server was running CentOS 6.3 and was being monitored using Nagios.

Shortly after I came aboard I found the following in the syslog file (/var/log/messages) after a crash by cross referencing the time Nagios reported the server being unavailable:

Sep 27 03:34:47 sit-admin kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Sep 27 04:03:55 sit-admin kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Sep 27 04:03:55 sit-admin kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0

The aforementioned error was a result of a memory leak in the version of Samba running on the server.  The error stems from the smbd daemon creating a new process every time a connection is attempted, successful or not.

The Linux kernel allows processes to request virtual memory whether there is enough memory available or not.  The kernel has a function called Out-Of-Memory Killer (oom-killer) that will kill processes that are of lower priority in order to keep the system running.

After many connections there were over 100 orphaned smbd processes on the system, thus exhausting memory and crashing the system.

Do to some (mis)configuration issues on a another person's part, we were unable to update the version of Samba in a straight forward manner. 

In order to circumvent further crashes I wrote a Python script that checks the for the amount of swap space available will restart the Samba daemon (or whatever daemon you specify) if the amount below the specified value.

The script can be found here.  I scheduled the script to run as cron job and no further crashes have occurred.