When your Server dies…

Having recently upgraded my server by an external housing and an additional SCSI controller (an LSI 9750-8e in addition to the 9690SA-8i), I was experiencing unexpected server crashes that had never happened before on that server.

The server just became unresponsive, sometimes still responded to a ping, sometimes not. Before, it had an uptime since fall 2008, only interrupted by the occasional kernel upgrade.

The server runs on an UPS, and an additional Filter. It has a redundant power supply. For that reason I was excluding problems on that side.

Server Specs: dual 3.00 GHz Xeon E5472 with 32 GB RAM, 7 x Seagate ST310006400SS 1TB drives running in a RAID 10 on the internal controller (1 hot spare; leaving 2.7 TB net space); 5 x SEAGATE ST33000650SS 3TB drives running in a RAID 10 on the external controller (1 hot spare; leaving 5.8 TB net space); both controllers have 512 MB cache each and are both battery buffered. Of course, the NIC is also redundant. The server was provided by Silicon Mechanics.

Now, since the problems arose after the upgrade I immediately thought about the controller. I dumped a log and fired it off to Silicon Mechanics, who showed an excellent level of responsiveness – yet they could not see any problems. They suggested memory to be failing. Yet, because of the correlation with the upgrade, I wasn’t really buying into the idea of memory modules failing just by accident around the same time.

Someone on #linuxger suggested using netconsole. Netconsole is basically a driver that fires off logging information (dmesg) to a remote host using UDP – beginning at a very early stage of the boot process. Since – once the server had failed – I could not log on to the system, I really wanted to get some “pre mortem” error messages. So I took out an old linux notebook, configured netconsole on the server and syslogd on the notebook, and waited.

For several days, nothing happened.

Now since I had to do some productive work, I took up my development VMs and continued my normal day job. I recently started a new project on the nice “Swiss Riviera”, close to my home, where I am to test and automate all sorts of installation procedures for Business Objects products. Since large customers are slow to provide machines, I was as usual resorting to using my own server for setting up all those testbeds. Each of those VMs consumes 8 GB memory. I was using snapshots a lot in order to roll back and try “the next scenario.”

The reason I’m giving you this context is this: In addition to the modification to the server hardware, there was also a modifiation in how the server was used. Even though I had previously used large VMs on that server, I had rarely used a number of them at the same time. Since VMWare is able to release unused memory to the host operating system, I could have started about any number of VMs.

As you can imagine, the server did fail.

Back to netconsole. I saw all sorts of things, but the most relevant were messages like this:

vmware-vmx invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Out of memory: Kill process 332 (upstart-udev-br) score 1 or sacrifice child
Killed process 332 (upstart-udev-br) total-vm:17052kB, anon-rss:40kB, file-rss:440kB
smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Out of memory: Kill process 622 (portmap) score 1 or sacrifice child
Killed process 622 (portmap) total-vm:8376kB, anon-rss:0kB, file-rss:400kB
rsyslogd invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0

etc. etc.

For some reason, someone was killing my processes arbitrarily, which did explain the fact that a) the server was becoming unresponsive and b) the server was sometimes responding to ping, sometimes not.

What was going on? Google revealed several hits of oom-killer in context with VMWare. The basic idea is that this process kicks in as a last resort of the operating system should it have to reclaim memory. It should start with killing the big memory hogs; in my case, it ended up killing about everything useful. I assume that the actual breakdown happened when e.g. all client machines decided at the same time to do some windows upgrade, leading to a spike in memory usage and thus the oom-killer to kick in. What I found out was that you can actually switch off the whole process by saying

echo 1 >/proc/sys/vm/overcommit_memory

Having said that, I decided to stress test the server a little. I started some of those VMs, used a tool to force memory allocation inside some of those VMs to let them shoot up in memory usage. I was barely using that tool in the first VM when the system resorted to coming to a screaching halt:

top - 11:51:57 up  1:02,  1 user,  load average: 64.24, 49.45, 29.38
Tasks: 187 total,  19 running, 166 sleeping,   0 stopped,   2 zombie
Cpu(s):  0.0%us, 100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  33014440k total, 32857916k used,   156524k free,    87920k buffers
Swap: 31455228k total,     2968k used, 31452260k free, 30577144k cached

  294 root      20   0     0    0    0 R   34  0.0   0:51.89 jbd2/sda1-8
 1337 root      20   0  586m  41m 6216 S   34  0.1   4:28.28 vmware-vmx

A load of 64… Wow. That’s quite impressive I thought. I started investigating some more. And mind the fact that memory was basically unused (total – cache!), and the swap was also not used.

The conclusion is that the extreme system load is a result uf iowaits while the disk backed memory files (*vmem) were updated. Some people say that you should set mainMem.useNamedFile=”FALSE” in the .vmx files; the idea behind that is to turn of the usage of *vmem files. However, this does not solve the problem, as VMWare Server will use an unlinked (deleted) file in $TMPDIR instead.

In addition, since I was using snapshots a lot, this made matters worse, as apparently this doubles the load on the disks. I went on to fiddle with the vm.swappiness and related parameters, and am giving you here the values that I ended up using (set through /etc/rc.local):

# cat /proc/sys/vm/overcommit_memory
# 0
echo 1 >/proc/sys/vm/overcommit_memory 

# cat /proc/sys/vm/dirty_background_ratio
# 5
echo 2 >/proc/sys/vm/dirty_background_ratio 

# cat /proc/sys/vm/dirty_ratio
# 10
echo 100 >/proc/sys/vm/dirty_ratio 

# cat /proc/sys/vm/swappiness
# 60
echo 0 >/proc/sys/vm/swappiness 

echo 9000 >/proc/sys/vm/dirty_expire_centisecs
echo 3000 >/proc/sys/vm/dirty_writeback_centisecs

From this post, which is a mirror from here and which I’m going reuse here, I learned that first of all, vm.dirty_background_ratio defines the percentage of memory that can become dirty before a background flushing of the pages to disk starts. Until this percentage is reached, no pages are flushed to disk. When the flushing starts, it is done in the background without disrupting the running foreground services.

The parameter vm.dirty_ratio defines the percentage of memory which can be occupied by dirty pages before a forced flush starts. If the percentage of dirty pages reaches this number, all the processes become synchronous, and they are not allowed to continue until the io operation they have requested has actually been performed and the data has been sent to disk.

Now, my understanding is that every virtual machine’s RAM is mapped (with mmap in 1MB blocks) to a file on disk.

I don’t know why VMWare server does not actually use the hosts’s RAM and rather puts the client’s RAM into a file on disk. While this does come in handy if you have to give to your client machines more memory that you physically have on the host, the tradeoff is – well – the extreme load I was experiencing.

Imagine now, that one of my virtual machines thinks it has 8 GB RAM, and I use my memory hog inside the VM to actually claim that memory. Every time the guest RAM is written, the Linux host sees a new dirty page that must then be written to the mmap’ed file on disk. Now 8 GB RAM in the VM is a lot of pages on the host, and therefore the dirty page count can increase rapidly even when the host does simple tasks like caching disk reads, which the guest kernel believes is improving performance, when in fact it is actually slowing down things since it is causing the linux host to perform slower disk writes as a result.

So that’s the onset: The guest changes its RAM, creates dirty pages, and the Linux host kernel needs to decide when to commit these dirty pages back to disk. This can happen in two ways: in the background by a pdflush process, with other processes still running, or synchronously, causing any processes that were dirtying RAM (e.g., vmware-vmx) to wait for the dirty page writes to complete. Clearly the synchronous write option is going to have severe consequences for the guest VM process, explaining the system coming to a standstill.

When the oom-killer was still active, he was at some point taking care of the “situation” by shooting from the hip.

Now as a solution, we need to find a better way to managing writing back the vmem files,

  1. avoiding synchronous writes, i.e. having all writing done by the background pdflush process,
  2. avoiding accumulation of dirty pages as long as synchronous write becomes unavoidable, and
  3. not updating the memory backing file too often since guest RAM can change fast, and we do not want to cause a disk write for every change.

The first point calls for a high value for vm.dirty_ratio. There will surely be good reasons why this is set to a default of 10% in recent kernels. However, for the configuration we’re having here, those 10% are quickly exhausted. For this reason, I maxed this value out at 100%, believing that all suspending processes for disk writes should be a last resort.

The second point calls for a low value for vm.dirty_background_ratio. The pdflush process does not appear to have the same adverse effect on the system performance, which is why I set this down to 2%, assuming that the earlier the background writes are started, the lower the chances are for reaching vm.dirty_ratio and hence forced synchronous writes.

The third point can be controlled by vm_dirty_expire_centisecs which I set to 9000, causing Linux to wait 90 seconds before committing changes to disk. This does affect the whole system, not just VMWare vmem, and caching disk writes for 90 seconds could be considered a “bad thing”. But since both controllers are battery backed and also, the whole system is running on a decent UPS that keeps everything up and running for 30 minutes, and finally we’re on redundant power supplies, this should be safe to do. In addition, this value seems to be ignored once vm.dirty_background_ratio is reached, and thus it does not counter point 1.

Having made those changes, I did the above stress test again, running loads of VMs and maxing out their memory (but just to about 80 % of their memory in order to avoid swapping inside the client machines). While before, already one single memory hog was bringing the system to a grinding halt, I am now able to max out memory allocation for all VMs without any problems.

I hope this information is of use for others. I’m particularly thankful for Graham who’s given me the right way. On the above link, there’s an additional discussion, in particular w.r.t caching on the hardware raid controller: “There is an issue, particularly with later distro’s, using (hardware?) RAID with VMware. This manifests itself with huge iowaits (64s not uncommon). The underlying cause seems to be vmware writing all dirty memory to disk even though everything is configured so that no swapping should take place.” The person posting that has a very similar hardware to this one here. According to 3dm2, caching is active on both controllers here.

A related discussion went on here, and the solution (towards the end of the thread) is in the same context as what I wrote above.

I’m going to investigate the caching further, but the above settings seem to already have improved the situation a lot.