Solving problems

Frontpage of vg.no

Frontpage of vg.no

Edit: I have posted a follow-up to this post: RE: Solving problems

At work we have an application called Dr. Front to manage the front pages of different sections on our site. If you look at our main front page you can clearly see that it is not automated. This page is generated in Dr. Front and then moved to our content servers.

The machine that this application runs on has been acting very strange the last couple of weeks. Suddenly it’s using up all available memory and the machine starts to swap, and then it dies. Since I love solving problems I thought I could have a look at this to try and fix it. Keep in mind that I’m a developer and no sysadmin. My approach to solving this problem is probably not the best one, but at least I learned something along the way. To fully understand this post you should have some basic Linux knowledge. Hopefully you’ll be able to pick up a few tricks along the way.

I’ll throw in a small drawing of this part of our setup (don’t you dare to laugh of my dia skills!)

Network overview

Network overview

When a user makes a request to for instance http://www.vg.no the request hits one of the two load balancers, then one of the Varnish machines (which contain rules that describes which request goes to which backend server) and then the backend server for the content you want. When one of the Apache servers request content from http://www.vg.no they also have to go through Varnish since http://www.vg.no/foo might hit a different backend server than http://www.vg.no/bar. So, any request to http://www.vg.no (or any of our other hosts for that matter) made from an Apache goes all the way out to one of the load balancers, and then back inside the network. When I started looking at this problem I was not aware that our setup was like this. The BIG-IP load balancer has just been been installed and the two Varnish machines has been split so each load balancer only talks to one Varnish. I thought we only had one load balancer that talked to both Varnishes. Let’s move on…

appfront is the machine that runs Dr. Front (written in PHP by Aptoma) and wildbill is the machine that contains the main front (among other things). When a front editor saves a front it gets moved over to wildbill from appfront. When you guys look at some page on http://www.vg.no/, wildbill is most likely the backend server you get content from. Now that you know some more about our setup, lets get on with hunting down the problem.

The very first thing I did was to log into appfront and look at top. I saw a bunch of httpd processes there so I decided to take a look at /server-status (Apache Module mod_status). This module gives you detailed information on the requests that is currently active. Keep in mind that appfront has ExtendedStatus On in its Apache configuration. Some information that I use in this post might not be available from /server-status if ExtendedStatus is Off.

Some of the processes that /server-status showed me had been running for hours. To check /server-status I simply ran the following command on appfront:

links -dump http://localhost/server-status|less

I used a one-liner that Audun at work gave me (and modified by yours truly to be even more magic) to sort the processes that was waiting:

links -dump-width 200 -dump http://localhost/server-status|grep -P '^([ ]+)([\d]+)-[\d]'|awk 'BEGIN {print "Seconds, PID, Method, Url\n--"}  $4 ~ /W/ {print $6, $2, $13, $14|"sort -n"}

This command dumps the content from /server-status using links (with a custom width to not make it wrap longer lines). The content is fed into grep that finds lines starting with some spaces, a number followed by a hyphen and then another number. These are the lines in the output that holds info on the processes. The filtered output gets fed to Awk which changes the output a bit so it’s more easy to read. The $4 ~ /W/ part of the Awk command makes sure that the 4th column in the input to Awk is equal to W. Read more about Apache’s /server-status to learn more about the output. The content is then sorted on the number of seconds each process has been running for before it’s displayed. Here is an example of the output:

Seconds, PID, Method, Url
--
0 27485 GET /server-status
668 27480 GET /drfront-1.3/
1784 925 GET /drfront-1.3/drfront/?do=login
2167 29437 POST /drfront-1.3/drfront/?do=keep-alive&fronttype_id=25
2197 29434 POST /drfront-1.3/drfront/?do=keep-alive&fronttype_id=25
2238 926 POST /drfront-1.3/drfront/?do=saveAndBurnFront&fronttype_id=25

Now I had the PID’s of a lot of processes. These PID’s can be used with strace to see what they are up to. strace can be used to trace system calls and signals. As root I did:

strace -p PID

where PID is the ID of the process that is hanging. First I took a peak at the one who had been hanging for the longest amount of time (PID 926). The output I got from strace was:

poll([{fd=25, events=POLLIN}], 1, 1000) = 0
poll([{fd=25, events=POLLIN}], 1, 0)    = 0
poll([{fd=25, events=POLLIN}], 1, 1000) = 0
poll([{fd=25, events=POLLIN}], 1, 0)    = 0

It just kept on going like that forever. poll is a C function that waits for an event on a file descriptor. This means that the process is waiting for an event to happen on some file. In the output from strace I saw {fd=25, events=POLLIN}. If we take a peek in the man page for poll we can see that the first argument is a struct and fd is the file descriptor and events is which events to wait for. The event in this case is POLLIN which simply means “There is data to read”. The process is simply waiting to read data from the file, which clearly never happens.

Now that I knew the id of the descriptor I could look in /proc on the file system to see which file it was. Before I did that I looked at the rest of the processes that was hanging. Next up was the process with PID 29434. The output this time was:

flock(18, LOCK_EX

The rest of the troubled processes gave this output. It seemed that the process hanging for the longest amount of time had an exclusive lock on a file, and since the process never died it never removed the lock. All other processes that also wanted this exclusive lock would wait for it forever. If we look at the man page for flock we can see that the first argument is a file descriptor. Since I knew the PID and the id of the file descriptor I did:

ls -la /proc/29434/fd/18

The result was

/tmp/drfront_sessions/<session file>

This meant that the session file of a specific user was locked forever. Any request this user made after the first process hung would result in a system call that would wait forever. The process that hung first was requested using AJAX. The logged in user did not see this and made a request to some other page in Dr. Front. This request would then hang forever since it didn’t get the exclusive lock on the session file. The user then had to login once more to get a new session id. The problem with this was that the processes that sat there waiting for the file to get unlocked took up a lot of memory. After a while there was no more available memory and the machine stopped responding.

Let’s go back to the first process and find out some more about the file descriptor it wants data from.

ls -la /proc/926/fd/25

The output this time was

lrwx------ 1 root root 64 Jul 22 12:21 /proc/926/fd/25 -> socket:[21172073]

This meant that the file descriptor was in fact a socket. I actually have no idea what the number in the brackets mean, but I figured out that this number could be used to find out some more about the socket connection using netstat. I ran the following command on appfront:

netstat -ae | grep 21172073

The output I got was:

tcp        0      0 appfront.m323.vgnett.:37333 www.vg.no:http           ESTABLISHED apache     21172073

This meant that Dr. Front made an outgoing connection to http://www.vg.no on port 80. I don’t know how all the bits and pieces are connected in Dr. Front so I didn’t know why it was doing outgoing connections at all. I fired up tshark to take a look at these connections:

tshark -N ntC -d tcp.port==80,http src 10.0.0.106 and dst port 80 and dst host www.vg.no

This gave me all connections made from 10.0.0.106 (appfront) to the http://www.vg.no host on port 80 (http). Here is a small part of the output from tshark:

0.106286 appfront.m323.vgnett.no -> www.vg.no    HTTP HEAD /drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8 HTTP/1.1
0.107729 appfront.m323.vgnett.no -> www.vg.no    HTTP HEAD /drfront/images/front-25-image-dddc9b55.jpeg?0d6d9fce05cf80b34d00c518ce11c8f8 HTTP/1.1

There was a whole lot of HTTP HEAD going on. According to the developer that wrote Dr. Front, these outgoing requests are made using cURL in PHP. Why would this cause so much trouble? I decided to take a look at the output from these requests on my own computer. I ran the following command on my laptop:

HEAD http://www.vg.no/drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8

First I hit joanie (Varnish) that used wildbill (Apache) as backend. Everything looked fine:

200 OK
Cache-Control: must-revalidate
Connection: close
...
X-VG-WebCache: joanie
X-VG-WebServer: wildbill

I tried some more to see if I could hit fritz. When it finally hit fritz I could see a small difference in the output from HEAD:

200 OK
Cache-Control: must-revalidate
...
X-VG-WebCache: fritz
X-VG-WebServer: wildbill
Xonnection: close

As you can see the Connection: close header that joanie produced is now Xonnection: close. This was strange because fritz and joanie use the exact same configuration and they are the same version of Varnish. Was it the load balancers that did some funky stuff? At this point I was not aware that each load balancer only used one Varnish each. One of the sysadmins (who is on vacation) told me that one of the load balancers had just been installed, and that they each had their own Varnish. Since fritz was the only one that responded with the bogus header I logged into fritz and ran the HEAD command locally to skip the BIG-IP load balancer:

HEAD -H "Host:www.vg.no" http://localhost/drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8

This time I got the correct Connection: close header. It looked like the new load balancer was doing something very weird.

I’m not sure why an Xonnection: close header would cause cURL in PHP to hang forever but my guess is that when cURL does not get a Connection: close header it keeps the connection open. Another thing I haven’t figured out is why appfront waits forever for some data to read on some sockets. It might be that wildbill decides to drop the connection for some reason, and since appfront never gets a Connection: close it just keeps on waiting for data on the socket that wildbill has already closed.

To fix this problem temporarily I edited the /etc/hosts file on appfront so that whenever it makes a request to http://www.vg.no it will only hit joanie. It has been running for some time now without any processes hanging, so hopefully I have managed to figure out the source of the problem was.

I have not yet figured out why a request through BIG-IP results in an Xonnection: close header though. It might be a software bug or simply a spelling error in a config file. The problem is not really solved until both load balancers respond with the correct header, so I’m not done yet!

The process of finding this problem was new to me and I feel that I picked up some neat tricks along the way. If any of my readers know of a better way to hunt down these kind of problems, or could shed some more light on the problems I found feel free to leave a comment!

Some good resources I used along the way:

Advertisements
This entry was posted in Technology, Work related and tagged , , , , , , , , , , , , . Bookmark the permalink.

3 Responses to Solving problems

  1. Rexxars says:

    Very well written post, I learned a couple of tricks :-)

    Should be interesting to see what the cause of the Xonnection bug is – keep us updated :-)

  2. Pingback: RE: Solving problems « Christer’s blog o’ fun

  3. haberler says:

    nice job Christer. i just wanna say thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s