- The problem seems to be that the main server loses contact with a subserver, and the subserver is lost. However, once lost the subserver comes back very quickly.
- One possibility is that the subserver daemon is just very busy. I have extended the timeout for waiting for subserver responses to 10 seconds (from 5 seconds).
- The main server also tries to contact subservers which I have not yet switched on. I have removed such entries.
- Nmap could cause problems, if it was to scan a virtual network subnet. I dont have black hole rules for machines not configured, so machines such as 10.0.5.100 would be routed to a subserver then routed back to the main server in a loop until the ttl expired. To see if this is the problem, I have implemented a split horizon rule in 10.0.5.100 to block (and log) this.
System Reliability
Moderator: drgrussell
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
System Reliability
Well since yesterday the system seems to be having some problems. I can see no real reason for these, so I am trying a few experiments. I am using the 10.0.5.* server for testing. My thought so far:
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
I think I have worked it out!
On the subservers I have written my own socket server (note 1). This is known as a "Listener" socket. It has a queue length for connections which have been ACKed but not processed. I had this set to 10, as in the olden days this was the maximum value possible. However, I notice that these days 128 is the norm. Running out of queue produces an error message (switching to ack cookies) exactly like the ones I have been seeing. If I am right the problem was that when I put in the new 64bit servers I increased the number of virtual machines per server, in turn increasing the connections per second on the server, and just pushed the limit of the queue over the edge.
When the servers are next idle I will restart them with the new queue length... 95% belief that this fixes the issue. Of course I wont know for sure until the next lab
Gordon.
Note1 - Writing your own server for sockets is stupid. Never do this. Use a pre-written socket server for your own programs, like apache/cgi or apache/soap. People like apache spent years working out how to handle connections reliably, and giving you ways to debug connections easily! Do as I say and not as I do...
On the subservers I have written my own socket server (note 1). This is known as a "Listener" socket. It has a queue length for connections which have been ACKed but not processed. I had this set to 10, as in the olden days this was the maximum value possible. However, I notice that these days 128 is the norm. Running out of queue produces an error message (switching to ack cookies) exactly like the ones I have been seeing. If I am right the problem was that when I put in the new 64bit servers I increased the number of virtual machines per server, in turn increasing the connections per second on the server, and just pushed the limit of the queue over the edge.
When the servers are next idle I will restart them with the new queue length... 95% belief that this fixes the issue. Of course I wont know for sure until the next lab

Gordon.
Note1 - Writing your own server for sockets is stupid. Never do this. Use a pre-written socket server for your own programs, like apache/cgi or apache/soap. People like apache spent years working out how to handle connections reliably, and giving you ways to debug connections easily! Do as I say and not as I do...
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
Having slept on it I just cannot see how the short listen queue would produce any issues. Apparently when the queue is exhausted the system switches to SYN cookies, which seem fine to me to use and should not cause any problems. I have came back around to the idea that the network was just busy with people doing stupid things with nmap...
Just in case I ran a pressure test on one of the new servers, starting and running 20 virtual machines in parallel and querying the server 20 times a second for their status (the normal querying time per vm is 5 seconds during booting and then every 60 seconds). The machine was not even overloaded... I managed to get a load of 4.5 for a few seconds (but it can go up to 8 before it is fully loaded). No problems. Even at this high load, queries were back in less than 1 second.
I have put in more monitoring routines to gather information should the problem happen again. The only thing left is the openvpn tunnel itself. If the problem reoccurs I can bypass this tunnel for control messages easily in minutes.
I will keep my eyes open for the next few days, but hopefully this is the end of the problems.
Gordon.
Just in case I ran a pressure test on one of the new servers, starting and running 20 virtual machines in parallel and querying the server 20 times a second for their status (the normal querying time per vm is 5 seconds during booting and then every 60 seconds). The machine was not even overloaded... I managed to get a load of 4.5 for a few seconds (but it can go up to 8 before it is fully loaded). No problems. Even at this high load, queries were back in less than 1 second.
I have put in more monitoring routines to gather information should the problem happen again. The only thing left is the openvpn tunnel itself. If the problem reoccurs I can bypass this tunnel for control messages easily in minutes.
I will keep my eyes open for the next few days, but hopefully this is the end of the problems.
Gordon.
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
Memory leak
The process which controls the virtual machines on each server seems to have a memory leak. It runs fine for about a week before slowing down and down until the server becomes unreliable. I am 99% sure that this is the source of the problem.
The real solution is to fix the bug, but I just dont have time right now. I know what part of the code it causing the problem, but I dont have enough memory of the surrounding code to make changes safely at this stage of the semester!
The quick solution is that I will reboot each server once or twice a week when its quiet. I will reboot it just before the assessment sessions too.
Sorry for the problems this semester...
Gordon.
The real solution is to fix the bug, but I just dont have time right now. I know what part of the code it causing the problem, but I dont have enough memory of the surrounding code to make changes safely at this stage of the semester!
The quick solution is that I will reboot each server once or twice a week when its quiet. I will reboot it just before the assessment sessions too.
Sorry for the problems this semester...
Gordon.
-
- Site Admin
- Posts: 426
- Joined: Sat Feb 12, 2005 8:57 pm
- Are you a robot or a human?: Human
Yep, looks like the issue was just a simple programming error by me. This led to a memory leak in a small part of the system, which grew to a CPU-crunching level with a few days of heavy use.
I fixed the leak and rebooted the servers yesterday. So far the memory usage seems normal.
Just in time for the assessments too!
Good luck in the tests next week.
Gordon.
I fixed the leak and rebooted the servers yesterday. So far the memory usage seems normal.
Just in time for the assessments too!
Good luck in the tests next week.
Gordon.
Who is online
Users browsing this forum: No registered users and 1 guest