System Reliability

Planned downtime and current issues.

Moderator: drgrussell

Post Reply
drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

System Reliability

Post by drgrussell » Tue Nov 11, 2008 1:09 pm

Well since yesterday the system seems to be having some problems. I can see no real reason for these, so I am trying a few experiments. I am using the 10.0.5.* server for testing. My thought so far:
  • The problem seems to be that the main server loses contact with a subserver, and the subserver is lost. However, once lost the subserver comes back very quickly.
  • One possibility is that the subserver daemon is just very busy. I have extended the timeout for waiting for subserver responses to 10 seconds (from 5 seconds).
  • The main server also tries to contact subservers which I have not yet switched on. I have removed such entries.
  • Nmap could cause problems, if it was to scan a virtual network subnet. I dont have black hole rules for machines not configured, so machines such as 10.0.5.100 would be routed to a subserver then routed back to the main server in a loop until the ttl expired. To see if this is the problem, I have implemented a split horizon rule in 10.0.5.100 to block (and log) this.
I will continue to monitor the situation and see if this server performs better than the others.

drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

Post by drgrussell » Tue Nov 11, 2008 3:51 pm

I have blocked VMs from sending packets out on anything other than port 53 and 80 on the 64 bit servers, and rolled out the firewall changes to all 64 bit machines.

Still keeping my eye on things...

drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

Post by drgrussell » Tue Nov 11, 2008 8:51 pm

I think I have worked it out!

On the subservers I have written my own socket server (note 1). This is known as a "Listener" socket. It has a queue length for connections which have been ACKed but not processed. I had this set to 10, as in the olden days this was the maximum value possible. However, I notice that these days 128 is the norm. Running out of queue produces an error message (switching to ack cookies) exactly like the ones I have been seeing. If I am right the problem was that when I put in the new 64bit servers I increased the number of virtual machines per server, in turn increasing the connections per second on the server, and just pushed the limit of the queue over the edge.

When the servers are next idle I will restart them with the new queue length... 95% belief that this fixes the issue. Of course I wont know for sure until the next lab :cry:

Gordon.

Note1 - Writing your own server for sockets is stupid. Never do this. Use a pre-written socket server for your own programs, like apache/cgi or apache/soap. People like apache spent years working out how to handle connections reliably, and giving you ways to debug connections easily! Do as I say and not as I do...

drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

Post by drgrussell » Wed Nov 12, 2008 11:22 am

Having slept on it I just cannot see how the short listen queue would produce any issues. Apparently when the queue is exhausted the system switches to SYN cookies, which seem fine to me to use and should not cause any problems. I have came back around to the idea that the network was just busy with people doing stupid things with nmap...

Just in case I ran a pressure test on one of the new servers, starting and running 20 virtual machines in parallel and querying the server 20 times a second for their status (the normal querying time per vm is 5 seconds during booting and then every 60 seconds). The machine was not even overloaded... I managed to get a load of 4.5 for a few seconds (but it can go up to 8 before it is fully loaded). No problems. Even at this high load, queries were back in less than 1 second.

I have put in more monitoring routines to gather information should the problem happen again. The only thing left is the openvpn tunnel itself. If the problem reoccurs I can bypass this tunnel for control messages easily in minutes.

I will keep my eyes open for the next few days, but hopefully this is the end of the problems.

Gordon.

drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

Memory leak

Post by drgrussell » Mon Nov 24, 2008 3:49 pm

The process which controls the virtual machines on each server seems to have a memory leak. It runs fine for about a week before slowing down and down until the server becomes unreliable. I am 99% sure that this is the source of the problem.

The real solution is to fix the bug, but I just dont have time right now. I know what part of the code it causing the problem, but I dont have enough memory of the surrounding code to make changes safely at this stage of the semester!

The quick solution is that I will reboot each server once or twice a week when its quiet. I will reboot it just before the assessment sessions too.

Sorry for the problems this semester...

Gordon.

drgrussell
Site Admin
Posts: 426
Joined: Sat Feb 12, 2005 8:57 pm
Are you a robot or a human?: Human

Post by drgrussell » Sat Nov 29, 2008 9:57 pm

Yep, looks like the issue was just a simple programming error by me. This led to a memory leak in a small part of the system, which grew to a CPU-crunching level with a few days of heavy use.

I fixed the leak and rebooted the servers yesterday. So far the memory usage seems normal.

Just in time for the assessments too!

Good luck in the tests next week.

Gordon.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest