Christer’s blog o’ fun

July 23, 2009

RE: Solving problems

Filed under: PHP, Technology, Work related — Tags: , , , , — christer @ 12:10 pm

This morning I learned some more about the issues described in my previous post: Solving problems.

The BIG-IP load balancer we are using has a feature called OneConnect. This feature can increase network throughput by managing connections between the BIG-IP and the backend servers more efficiently.

After browsing the support pages over at f5.com I found a document that had an overview of OneConnect. The document states that the FastHTTP profile (a profile in the BIG-IP) uses an implementation of OneConnect that transforms Connection: close headers to Xonnection: Close. fritz (the Varnish that BIG-IP was using) was using the FastHTTP profile, and that is why it returned these Xonnection: close headers. I thought this was a bug, but it is actually a feature (it’s not a bug, it’s a feature).

I guess the cURL library used from PHP in the Dr. Front application on appfront does not understand what this header is and keeps a connection open to the BIG-IP. The connection between the BIG-IP and wildbill (the backend server that Dr. Front did some HTTP HEAD requests to) is probably closed so Dr. Front never receives the data it should. I thought cURL had some sort of timeout pr. default but it doesn’t seem like this is the case.

I took a quick peek at the part of the code in Dr. Front that does these HEAD requests:

// ...
$ch = curl_init();
 curl_setopt($ch, CURLOPT_URL, $url);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_NOBODY, true);
 $data = curl_exec($ch);
// ...

There is nothing in that code that has anything to do with timeouts. cURL has an option called CURLOPT_TIMEOUT that can be used for this. One line of code should be added before the call to curl_exec:

curl_setopt($ch, CURLOPT_TIMEOUT, 10);

This line would then tell curl_exec that it had at least 10 seconds to do it’s magic. I have not tested this yet but once the developer of Dr. Front is back from his vacation I’ll get him to try this in combination with the Varnish that returns the Xonnection: close header.

Resources:

July 22, 2009

Solving problems

Filed under: Technology, Work related — Tags: , , , , , , , , , , , , — christer @ 4:32 pm
Frontpage of vg.no

Frontpage of vg.no

Edit: I have posted a follow-up to this post: RE: Solving problems

At work we have an application called Dr. Front to manage the front pages of different sections on our site. If you look at our main front page you can clearly see that it is not automated. This page is generated in Dr. Front and then moved to our content servers.

The machine that this application runs on has been acting very strange the last couple of weeks. Suddenly it’s using up all available memory and the machine starts to swap, and then it dies. Since I love solving problems I thought I could have a look at this to try and fix it. Keep in mind that I’m a developer and no sysadmin. My approach to solving this problem is probably not the best one, but at least I learned something along the way. To fully understand this post you should have some basic Linux knowledge. Hopefully you’ll be able to pick up a few tricks along the way.

I’ll throw in a small drawing of this part of our setup (don’t you dare to laugh of my dia skills!)

Network overview

Network overview

When a user makes a request to for instance www.vg.no the request hits one of the two load balancers, then one of the Varnish machines (which contain rules that describes which request goes to which backend server) and then the backend server for the content you want. When one of the Apache servers request content from www.vg.no they also have to go through Varnish since www.vg.no/foo might hit a different backend server than www.vg.no/bar. So, any request to www.vg.no (or any of our other hosts for that matter) made from an Apache goes all the way out to one of the load balancers, and then back inside the network. When I started looking at this problem I was not aware that our setup was like this. The BIG-IP load balancer has just been been installed and the two Varnish machines has been split so each load balancer only talks to one Varnish. I thought we only had one load balancer that talked to both Varnishes. Let’s move on…

appfront is the machine that runs Dr. Front (written in PHP by Aptoma) and wildbill is the machine that contains the main front (among other things). When a front editor saves a front it gets moved over to wildbill from appfront. When you guys look at some page on http://www.vg.no/, wildbill is most likely the backend server you get content from. Now that you know some more about our setup, lets get on with hunting down the problem.

The very first thing I did was to log into appfront and look at top. I saw a bunch of httpd processes there so I decided to take a look at /server-status (Apache Module mod_status). This module gives you detailed information on the requests that is currently active. Keep in mind that appfront has ExtendedStatus On in its Apache configuration. Some information that I use in this post might not be available from /server-status if ExtendedStatus is Off.

Some of the processes that /server-status showed me had been running for hours. To check /server-status I simply ran the following command on appfront:

links -dump http://localhost/server-status|less

I used a one-liner that Audun at work gave me (and modified by yours truly to be even more magic) to sort the processes that was waiting:

links -dump-width 200 -dump http://localhost/server-status|grep -P '^([ ]+)([\d]+)-[\d]'|awk 'BEGIN {print "Seconds, PID, Method, Url\n--"}  $4 ~ /W/ {print $6, $2, $13, $14|"sort -n"}

This command dumps the content from /server-status using links (with a custom width to not make it wrap longer lines). The content is fed into grep that finds lines starting with some spaces, a number followed by a hyphen and then another number. These are the lines in the output that holds info on the processes. The filtered output gets fed to Awk which changes the output a bit so it’s more easy to read. The $4 ~ /W/ part of the Awk command makes sure that the 4th column in the input to Awk is equal to W. Read more about Apache’s /server-status to learn more about the output. The content is then sorted on the number of seconds each process has been running for before it’s displayed. Here is an example of the output:

Seconds, PID, Method, Url
--
0 27485 GET /server-status
668 27480 GET /drfront-1.3/
1784 925 GET /drfront-1.3/drfront/?do=login
2167 29437 POST /drfront-1.3/drfront/?do=keep-alive&fronttype_id=25
2197 29434 POST /drfront-1.3/drfront/?do=keep-alive&fronttype_id=25
2238 926 POST /drfront-1.3/drfront/?do=saveAndBurnFront&fronttype_id=25

Now I had the PID’s of a lot of processes. These PID’s can be used with strace to see what they are up to. strace can be used to trace system calls and signals. As root I did:

strace -p PID

where PID is the ID of the process that is hanging. First I took a peak at the one who had been hanging for the longest amount of time (PID 926). The output I got from strace was:

poll([{fd=25, events=POLLIN}], 1, 1000) = 0
poll([{fd=25, events=POLLIN}], 1, 0)    = 0
poll([{fd=25, events=POLLIN}], 1, 1000) = 0
poll([{fd=25, events=POLLIN}], 1, 0)    = 0

It just kept on going like that forever. poll is a C function that waits for an event on a file descriptor. This means that the process is waiting for an event to happen on some file. In the output from strace I saw {fd=25, events=POLLIN}. If we take a peek in the man page for poll we can see that the first argument is a struct and fd is the file descriptor and events is which events to wait for. The event in this case is POLLIN which simply means “There is data to read”. The process is simply waiting to read data from the file, which clearly never happens.

Now that I knew the id of the descriptor I could look in /proc on the file system to see which file it was. Before I did that I looked at the rest of the processes that was hanging. Next up was the process with PID 29434. The output this time was:

flock(18, LOCK_EX

The rest of the troubled processes gave this output. It seemed that the process hanging for the longest amount of time had an exclusive lock on a file, and since the process never died it never removed the lock. All other processes that also wanted this exclusive lock would wait for it forever. If we look at the man page for flock we can see that the first argument is a file descriptor. Since I knew the PID and the id of the file descriptor I did:

ls -la /proc/29434/fd/18

The result was

/tmp/drfront_sessions/<session file>

This meant that the session file of a specific user was locked forever. Any request this user made after the first process hung would result in a system call that would wait forever. The process that hung first was requested using AJAX. The logged in user did not see this and made a request to some other page in Dr. Front. This request would then hang forever since it didn’t get the exclusive lock on the session file. The user then had to login once more to get a new session id. The problem with this was that the processes that sat there waiting for the file to get unlocked took up a lot of memory. After a while there was no more available memory and the machine stopped responding.

Let’s go back to the first process and find out some more about the file descriptor it wants data from.

ls -la /proc/926/fd/25

The output this time was

lrwx------ 1 root root 64 Jul 22 12:21 /proc/926/fd/25 -> socket:[21172073]

This meant that the file descriptor was in fact a socket. I actually have no idea what the number in the brackets mean, but I figured out that this number could be used to find out some more about the socket connection using netstat. I ran the following command on appfront:

netstat -ae | grep 21172073

The output I got was:

tcp        0      0 appfront.m323.vgnett.:37333 www.vg.no:http           ESTABLISHED apache     21172073

This meant that Dr. Front made an outgoing connection to www.vg.no on port 80. I don’t know how all the bits and pieces are connected in Dr. Front so I didn’t know why it was doing outgoing connections at all. I fired up tshark to take a look at these connections:

tshark -N ntC -d tcp.port==80,http src 10.0.0.106 and dst port 80 and dst host www.vg.no

This gave me all connections made from 10.0.0.106 (appfront) to the www.vg.no host on port 80 (http). Here is a small part of the output from tshark:

0.106286 appfront.m323.vgnett.no -> www.vg.no    HTTP HEAD /drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8 HTTP/1.1
0.107729 appfront.m323.vgnett.no -> www.vg.no    HTTP HEAD /drfront/images/front-25-image-dddc9b55.jpeg?0d6d9fce05cf80b34d00c518ce11c8f8 HTTP/1.1

There was a whole lot of HTTP HEAD going on. According to the developer that wrote Dr. Front, these outgoing requests are made using cURL in PHP. Why would this cause so much trouble? I decided to take a look at the output from these requests on my own computer. I ran the following command on my laptop:

HEAD http://www.vg.no/drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8

First I hit joanie (Varnish) that used wildbill (Apache) as backend. Everything looked fine:

200 OK
Cache-Control: must-revalidate
Connection: close
...
X-VG-WebCache: joanie
X-VG-WebServer: wildbill

I tried some more to see if I could hit fritz. When it finally hit fritz I could see a small difference in the output from HEAD:

200 OK
Cache-Control: must-revalidate
...
X-VG-WebCache: fritz
X-VG-WebServer: wildbill
Xonnection: close

As you can see the Connection: close header that joanie produced is now Xonnection: close. This was strange because fritz and joanie use the exact same configuration and they are the same version of Varnish. Was it the load balancers that did some funky stuff? At this point I was not aware that each load balancer only used one Varnish each. One of the sysadmins (who is on vacation) told me that one of the load balancers had just been installed, and that they each had their own Varnish. Since fritz was the only one that responded with the bogus header I logged into fritz and ran the HEAD command locally to skip the BIG-IP load balancer:

HEAD -H "Host:www.vg.no" http://localhost/drfront/images/front-25-image-c801b964.jpeg?2203138e92f24003197c4af8f7185aa8

This time I got the correct Connection: close header. It looked like the new load balancer was doing something very weird.

I’m not sure why an Xonnection: close header would cause cURL in PHP to hang forever but my guess is that when cURL does not get a Connection: close header it keeps the connection open. Another thing I haven’t figured out is why appfront waits forever for some data to read on some sockets. It might be that wildbill decides to drop the connection for some reason, and since appfront never gets a Connection: close it just keeps on waiting for data on the socket that wildbill has already closed.

To fix this problem temporarily I edited the /etc/hosts file on appfront so that whenever it makes a request to www.vg.no it will only hit joanie. It has been running for some time now without any processes hanging, so hopefully I have managed to figure out the source of the problem was.

I have not yet figured out why a request through BIG-IP results in an Xonnection: close header though. It might be a software bug or simply a spelling error in a config file. The problem is not really solved until both load balancers respond with the correct header, so I’m not done yet!

The process of finding this problem was new to me and I feel that I picked up some neat tricks along the way. If any of my readers know of a better way to hunt down these kind of problems, or could shed some more light on the problems I found feel free to leave a comment!

Some good resources I used along the way:

July 20, 2009

AOTW 27, 2009: The Clash – London Calling

Filed under: AOTW, Music — Tags: , , — christer @ 3:21 pm

I guess most of you have heard of this album: London Calling from 1979 by The Clash. According to Wikipedia this album was released on my birthday!

Favorite tracks:

Spotify: http://open.spotify.com/album/2FUf660uj42vZcgf2DvtBn

AOTW 26, 2009: Bigbang – Electric Psalmbook

Filed under: AOTW, Music — Tags: , , — christer @ 3:13 pm

Some Norwegian Rock and Roll for you this week. Electric Psalmbook is Bigbang’s second album and in my opinion the best.

My favorite tracks:

Spotify: http://open.spotify.com/album/2cjboZnbleOXeMpXrsLwPA

July 16, 2009

AOTW 25, 2009: Carcass – Heartwork

Filed under: AOTW, Music — Tags: , , — christer @ 8:45 pm

The AOTW award this week goes to the English metal band Carcass and their Heartwork album.

Favorite tracks:

Spotify: http://open.spotify.com/album/6BZjyJZKG6D1HiowOwIT95

AOTW 24, 2009: Abhinanda – Senseless

Filed under: AOTW, Music — Tags: , , — christer @ 7:36 pm

This week the award goes to the Swedish straight edge hardcore band Abhinanda and their Senseless album from 1994.

Favorite tracks:

  • Needle
  • Fallen
  • Dragon

Spotify: N/A

AOTW 23, 2009: Sunny Day Real Estate – Diary

Filed under: AOTW, Music — Tags: , , — christer @ 6:21 pm

This week the award goes to Sunny Day Real Estate’s first album: Diary. Wikipedia states that they were instrumental in establishing the Emo genre. That doesn’t make me an Emo does it?!

Favorite tracks:

  • In Circles
  • 47
  • Shadows

Spotify: N/A

AOTW 22, 2009: Tool – 10,000 Days

Filed under: AOTW, Music — Tags: , , — christer @ 5:14 pm

Another album from Tool on AOTW. This time it’s their 2006 album, 10,000 Days.

Favorite tracks:

  • Vicarious
  • Wings For Marie (Pt 1)
  • 10,000 Days (Wings Pt 2)

Spotify: N/A

July 14, 2009

AOTW 21, 2009: Bad Religion – Recipe For Hate

Filed under: AOTW, Music — Tags: , , — christer @ 7:10 pm

One of my alltime favorite bands revisit the AOTW award. This time it’s Recipe For Hate from 1993 by Bad Religion.

My favorite tracks are:

Spotify: http://open.spotify.com/album/6AXQoe0HIyZJ4TLlAmwG6d

AOTW 20, 2009: The Queers – Don’t Back Down

Filed under: AOTW, Music — Tags: , , — christer @ 7:02 pm

Some more punk rock on AOTW. This time it’s Don’t Back Down from 1996 by The Queers.

My favorite tracks:

Spotify: http://open.spotify.com/album/3MFo3MkpFL17xgKJG5tkR8

Older Posts »

Blog at WordPress.com.