Lag Wars - The Ongoing Saga

Think you've found a bug? Here's the place to talk about it. It might actually get fixed.

Moderator: Public Forum Moderators

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Sun Oct 19, 2008 10:49 pm

An update to the ongoing lag issue.

I posted this update in-game today, reposting here for those who don't *cough* read the in-game updates.
Gobberwart wrote: Lag Wars Part IV: A New Hope. It is a period of civil war. Paradox spaceships, striking from... wait, wrong story. cough. ANYWAY, the logs I've been keeping have identified an issue that occurs at 10:10PM (game time) every day, resulting in about 6-7 minutes of hideous lag. I have confirmed this is nothing to do with Paradox, and have logged yet another request to the hosting provider that will hopefully save our people and restore freedom to the galaxy...
Mon, Oct 20, 2008 19:11 - Gobberwart
For those who are masochists, the post that I sent to our provider is also shown below. Sorry, it's a REALLY long post, especially with the log files, but gives an idea of what I'm trying to do to fix the problem.
Gobberwart wrote:It's been a while since I've contacted you regarding this issue, and things have definitely improved significantly over my early experience with HostV, but I've been monitoring it and there are still regular times when server load reaches an unacceptable level.

Sure, it's only for a few minutes each time, but it happens daily and people stop using my site when it happens. The nature of my site involves people logging in and remaining logged in, so a few minutes of "lag" every day is likely to cause my users a significant amount of grief over time and I'd like someone to find out what's going on and put a stop to it.

The following is part of a log I've been keeping of uptime output, and shows any time that server load exceeds 2.00 (anything over 1.5 results in noticable lag on my site).

As you can clearly see from the logs, with a very few exceptions, there is something happening at approximately 07:40 to 07:46 GMT every day, and at approximately 12:01 GMT every couple of days that is resulting in significant, unacceptable load.

I should note that there is nothing of any kind running on my virtual server, other than the same things that run all day every day, so it's nothing I'm doing causing the issue.

Note: All times are GMT, so please convert to your timezone as appropriate.

Also note: We've already been through the "everything's fine, it must be your code" and "it's working properly now (several hours outside the times mentioned) so it must be fixed" conversations several times, and in each case it has been proven that there has been a problem with the host, and improvements have been made. Please ensure this is investigated properly. Refer to Harry if further historical information is required.

Log:

07:41:01 up 1 day, 2:34, 0 users, load average: 3.78, 1.54, 0.68
07:42:01 up 1 day, 2:35, 0 users, load average: 2.92, 1.77, 0.81
07:46:01 up 1 day, 2:39, 0 users, load average: 2.23, 1.73, 0.98
08:01:01 up 1 day, 3:21, 3 users, load average: 2.24, 0.95, 0.47
07:29:01 up 2 days, 2:22, 0 users, load average: 2.20, 1.30, 0.84
07:41:01 up 2 days, 2:34, 0 users, load average: 3.10, 1.53, 1.00
07:41:01 up 2 days, 3:01, 3 users, load average: 2.03, 0.81, 0.30
07:42:01 up 2 days, 2:35, 0 users, load average: 2.58, 1.70, 1.09
07:43:01 up 2 days, 2:36, 0 users, load average: 2.09, 1.73, 1.14
07:46:01 up 2 days, 2:39, 0 users, load average: 2.01, 1.82, 1.27
08:01:01 up 2 days, 2:54, 0 users, load average: 2.18, 1.41, 1.15
12:01:01 up 2 days, 4:03, 1 user, load average: 2.13, 0.68, 0.23
07:41:02 up 3 days, 2:34, 0 users, load average: 3.58, 1.22, 0.43
07:42:01 up 3 days, 2:35, 0 users, load average: 2.18, 1.25, 0.49
07:46:01 up 3 days, 23:48, 0 users, load average: 2.20, 1.26, 0.59
05:01:01 up 4 days, 21:03, 0 users, load average: 2.56, 1.51, 0.97
07:41:01 up 4 days, 2:34, 0 users, load average: 3.28, 1.59, 0.70
07:42:01 up 4 days, 2:35, 0 users, load average: 2.21, 1.61, 0.76
07:43:01 up 4 days, 2:36, 0 users, load average: 2.23, 1.69, 0.84
07:45:01 up 4 days, 2:38, 0 users, load average: 2.85, 1.99, 1.04
07:46:01 up 4 days, 2:39, 0 users, load average: 3.68, 2.49, 1.28
07:40:02 up 5 days, 2:33, 0 users, load average: 2.35, 1.18, 0.63
07:41:01 up 5 days, 2:34, 0 users, load average: 2.69, 1.55, 0.79
07:41:01 up 5 days, 23:43, 0 users, load average: 3.00, 0.99, 0.41
07:42:01 up 5 days, 23:44, 0 users, load average: 2.09, 1.14, 0.50
07:56:01 up 5 days, 2:49, 0 users, load average: 2.20, 1.10, 0.87
07:41:02 up 6 days, 2:34, 0 users, load average: 4.90, 2.01, 0.95
07:42:01 up 6 days, 2:35, 0 users, load average: 4.07, 2.35, 1.13
07:43:01 up 6 days, 2:36, 0 users, load average: 2.22, 2.14, 1.13
07:46:01 up 6 days, 2:39, 0 users, load average: 2.41, 2.10, 1.28
08:01:01 up 6 days, 3:21, 1 user, load average: 2.10, 0.88, 0.40
12:01:01 up 6 days, 4:03, 0 users, load average: 3.31, 1.08, 0.42
12:02:01 up 6 days, 4:04, 0 users, load average: 3.17, 1.49, 0.60
07:39:04 up 7 days, 2:32, 1 user, load average: 3.33, 1.16, 0.52
07:40:01 up 7 days, 2:33, 1 user, load average: 2.97, 1.48, 0.67
07:43:01 up 7 days, 2:36, 1 user, load average: 4.70, 4.18, 1.95
07:44:01 up 7 days, 2:37, 1 user, load average: 2.51, 3.66, 1.90
07:46:01 up 7 days, 2:39, 1 user, load average: 2.26, 3.14, 1.92
07:41:01 up 8 days, 2:34, 2 users, load average: 2.05, 1.05, 0.55
07:41:01 up 8 days, 3:01, 1 user, load average: 2.89, 1.08, 0.45
07:42:01 up 8 days, 2:35, 2 users, load average: 2.01, 1.24, 0.65
08:01:01 up 8 days, 2:54, 2 users, load average: 2.28, 1.26, 0.90
07:41:01 up 9 days, 2:34, 1 user, load average: 2.94, 1.32, 0.58
07:41:01 up 9 days, 3:01, 1 user, load average: 2.09, 0.69, 0.24
12:01:01 up 9 days, 7:21, 1 user, load average: 2.08, 0.59, 0.20
07:41:01 up 10 days, 2:34, 2 users, load average: 5.58, 2.18, 0.97
07:42:01 up 10 days, 2:35, 2 users, load average: 3.94, 2.44, 1.14
07:43:01 up 10 days, 2:36, 2 users, load average: 2.63, 2.31, 1.17
07:44:01 up 10 days, 2:37, 2 users, load average: 2.94, 2.45, 1.29
07:46:01 up 10 days, 2:39, 2 users, load average: 2.73, 2.49, 1.44
06:56:01 up 12 days, 1:49, 1 user, load average: 2.02, 1.09, 0.58
07:39:01 up 12 days, 8:22, 0 users, load average: 2.59, 0.85, 0.30
07:40:01 up 12 days, 2:33, 1 user, load average: 3.18, 1.32, 0.54
07:40:01 up 12 days, 8:23, 0 users, load average: 2.83, 1.26, 0.47
07:41:01 up 12 days, 2:34, 1 user, load average: 3.66, 1.87, 0.78
07:41:01 up 12 days, 3:01, 1 user, load average: 3.44, 1.57, 0.73
07:41:01 up 12 days, 8:24, 0 users, load average: 4.17, 2.02, 0.79
07:42:01 up 12 days, 2:35, 1 user, load average: 2.26, 1.80, 0.82
07:42:01 up 12 days, 3:02, 1 user, load average: 2.69, 1.71, 0.84
07:42:20 up 12 days, 8:25, 1 user, load average: 3.85, 2.42, 1.02
07:43:01 up 12 days, 8:26, 1 user, load average: 3.41, 2.51, 1.12
07:44:01 up 12 days, 8:27, 1 user, load average: 2.08, 2.31, 1.13
03:31:01 up 13 days, 4:14, 0 users, load average: 2.23, 1.33, 0.91
05:01:01 up 13 days, 5:44, 1 user, load average: 2.07, 0.76, 0.38
05:26:01 up 13 days, 46 min, 0 users, load average: 2.55, 1.70, 1.20
05:32:01 up 13 days, 52 min, 0 users, load average: 2.48, 1.79, 1.36
05:33:01 up 13 days, 53 min, 0 users, load average: 2.30, 1.83, 1.40
07:29:01 up 13 days, 2:49, 1 user, load average: 2.29, 1.00, 0.58
07:30:01 up 13 days, 2:50, 1 user, load average: 2.12, 1.22, 0.68
07:40:11 up 13 days, 8:23, 1 user, load average: 2.57, 1.02, 0.47
07:42:01 up 13 days, 8:25, 1 user, load average: 4.68, 2.60, 1.13
07:43:01 up 13 days, 8:26, 1 user, load average: 2.65, 2.41, 1.15
07:46:01 up 13 days, 3:06, 1 user, load average: 2.70, 1.73, 1.13
07:46:01 up 13 days, 8:29, 1 user, load average: 3.20, 2.47, 1.37
08:01:01 up 13 days, 8:44, 2 users, load average: 2.35, 1.34, 1.10
12:01:01 up 13 days, 12:44, 1 user, load average: 3.57, 1.42, 0.55
12:02:01 up 13 days, 12:45, 1 user, load average: 3.49, 1.80, 0.74
12:03:01 up 13 days, 12:46, 1 user, load average: 3.05, 1.99, 0.87
07:40:01 up 14 days, 3:00, 1 user, load average: 3.11, 1.60, 1.03
07:41:01 up 14 days, 3:01, 1 user, load average: 5.08, 2.49, 1.37
07:41:01 up 14 days, 8:24, 1 user, load average: 4.29, 1.84, 0.86
07:42:01 up 14 days, 8:25, 1 user, load average: 4.24, 2.25, 1.06
07:42:03 up 14 days, 3:02, 1 user, load average: 3.30, 2.45, 1.42
07:43:01 up 14 days, 3:03, 1 user, load average: 2.39, 2.36, 1.46
07:43:01 up 14 days, 8:26, 1 user, load average: 2.43, 2.13, 1.09
07:46:01 up 14 days, 3:06, 1 user, load average: 2.99, 2.38, 1.60
07:46:01 up 14 days, 8:29, 1 user, load average: 2.83, 1.99, 1.18
07:47:01 up 14 days, 3:07, 1 user, load average: 2.29, 2.34, 1.63
12:01:01 up 14 days, 12:44, 1 user, load average: 2.36, 0.87, 0.32
12:02:01 up 14 days, 12:45, 1 user, load average: 2.01, 1.07, 0.42
12:01:01 up 16 days, 12:44, 0 users, load average: 3.58, 1.14, 0.39
12:02:01 up 16 days, 12:45, 0 users, load average: 2.61, 1.31, 0.49
07:41:01 up 17 days, 8:24, 1 user, load average: 2.26, 0.77, 0.27
OK, that's that. There are still occasional irregular spikes, which I doubt we can do much about until we get a dedicated server, and the database server crashes a bit too often, although I strongly suspect that's related to the hosting provider trying to run "helpful" maintenance scripts all over the place, and again will be fixed when we can finally afford our own dedicated server. Stay tuned for more really boring nerdy stuff.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Known Issues

Unread post by Gobberwart » Mon Oct 20, 2008 9:40 pm

I've had a couple of responses from HostV which suggest that they're taking it seriously and will be around at the appropriate time tonight to (hopefully) figure out what's causing the problem. Let's see what happens.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Known Issues

Unread post by Gobberwart » Mon Oct 20, 2008 11:45 pm

Just got the following from HostV:

--

I have closely monitored your server load and also main hardware node form 07:00GMT to 07:50GMT, and found that there is a load spike in the main hardware node also during the time.

We are going to have a detailed look in to the main hardware node to find the root cause and will update you after consulting with our Technical head.

--

This is probably going to take some time to resolve, but at least they're admitting there's a problem and taking some action to resolve it.
Image

User avatar

stroby
Forum Addict
Posts: 141
Joined: Thu Jul 03, 2008 11:13 am
Location: England

Re: Known Issues

Unread post by stroby » Wed Oct 22, 2008 10:04 am

Yay this is a cliff hanger so logically next we will get Lag wars II

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Thu Oct 23, 2008 1:12 pm

I decided to split this off from the known issues topic to keep that one simple.

Latest update: HostV closed the job off because I "hadn't replied within 24 hours". GRRRR. Apparently their automated system needs some work. I reopened the job and asked them to keep looking, because the lag happened at the same time again yesterday despite them making some minor changes to stuff. Granted, not as bad as it has been (max 2.73 load instead of the 4+ we've been seeing) but still too much and still during that specific time period.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Mon Nov 03, 2008 12:38 pm

Lag Wars Episode V - The Provider Strikes Back

It is a dark time for Paradox. Although the morale of the Host's support staff has been destroyed, HostV's troops have driven the lag from their its hidden base and pursued it across a longer period of time...

It's been over a week since I last updated this thread, and I'd love to be able to report that it's because the problem is fixed. I can't.

There has been daily communication with HostV, basically consisting of:

HostV: We changed some things, let us know if the problem still occurs.
Me: Yes, it does.
HostV: OK, we'll try something else tomorrow.

Their changes HAVE made some difference, and as at yesterday, instead of getting a single daily load spike of 5-7 minutes at 07:40 GMT, we are now getting 3 load spikes of 3-5 minutes at 08:40, 08:50 and 09:00 GMT. Obviously the change to non-daylight saving time in the US has shuffled it by an hour, but whatever changes HostV have made have split the spikes up into smaller (and more annoying) chunks.

Yesterday there was a very lengthy lag occurrence that lasted from 06:00 GMT to 09:15 GMT, with a peak between 07:45 and 09:15 which made the entire site virtually unusable, which resulted in my sending a couple of vitriolic emails to HostV as well as placing a couple of posts on webhostingtalk.com.

As of 30 minutes ago, they are now talking about moving us to a different node with a lot less users on it, which is great but also means there will be some downtime on Paradox and is most likely only a temporary solution (ie. until the node gets more users on it).

They assure me that the migration will go perfectly (shudder) and that everything will still work (egads) and that downtime will be minimal (eek), but I will not be at all surprised if something goes awry and needs to be fixed.

I will do my absolute best to minimise the downtime, and try to schedule it to happen right after intermission, but obviously I'm somewhat at their mercy. I'll keep you informed.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Mon Nov 03, 2008 3:16 pm

OK, I'm going to take this opportunity right now to inform you (see, I promised) that the migration is likely to be anything from a minor to a major disaster. I VERY much doubt it will go smoothly, because I have very little confidence in HostV to ensure that it does.

In relation to the previous post, I sent them an email detailing the best time for the migration (post-intermission tonight) and asking them to contact me via MSN Messenger to confirm and make sure we get it all synchronised etc. I also asked them in an even earlier email whether migration to a new node would result in an IP address change, because that will cause DNS issues and a potential 24-hour outage. The confirmed that it wouldn't. So far soo good.

So a few minutes ago I received this email:

Code: Select all

Hi stuart,

I'll try my best to contact you but if not we'll migrate you quickly and then your IP should change as well.
Of course, I've replied asking them to confirm the IP address situation and asked that they don't "do their best" to contact me. I've asked them not to migrate anything UNLESS they contact me.

But what I suspect will happen is as follows:

* At some arbitrary time, they'll just decide to turn off the production site and migrate it. Without contacting me.
* Worse still, they'll probably copy everything across to the new server BEFORE switching off the old one, because they'll have some bright idea that this will mean less down-time. In fact, it will mean that any changes made to the database between when they copy it and when it's turned off will be lost.
* The IP address will probably change, which means DNS will be stuffed, and the entire site (including forums, wiki, mail, the works) will be down for anything from a couple of hours to an entire day.

I would LOVE to be proven wrong on this, but my confidence level is not at all high.
Image

User avatar

stroby
Forum Addict
Posts: 141
Joined: Thu Jul 03, 2008 11:13 am
Location: England

Re: Lag Wars - The Ongoing Saga

Unread post by stroby » Wed Nov 05, 2008 12:08 am

Ok just thought I'd say that at around 9:30am GMT I was kicked from game with a matinence messege but have since been able to log back in, was that it?

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Wed Nov 05, 2008 12:11 am

No.. that's daily maintenance. ALWAYS happens at 09:30 GMT :D
Image

User avatar

stroby
Forum Addict
Posts: 141
Joined: Thu Jul 03, 2008 11:13 am
Location: England

Re: Lag Wars - The Ongoing Saga

Unread post by stroby » Wed Nov 05, 2008 12:18 am

well thats a silly time some of us are awake then :P actully to be fair if i could be asleep I would

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Sat Nov 08, 2008 8:46 pm

Alrighty so it turns out that not only do the people at HostV appear incompetent, but they're also potentially liars. See, since the move, I reported a daily load spike at 05:45 GMT. The lag it caused wasn't major, but it was still worth reporting.

At 05:55 GMT yesterday I got a message from HostV to tell me that the problem should be fixed and to let them know. So I checked my logs... Odd, the logs stopped at 05:44. They're supposed to log every minute.

I check further... No scheduled tasks at all have run since 05:44. Turns out both the cron service (scheduled tasks) and chkservd service (checks to see if other services are stopped) are turned off. Not by me.

So maybe they coincidentally just happened to stop at that time? Nope, logs show:

Nov 8 05:43:58 vps800 crond: crond shutdown succeeded
Nov 8 05:44:35 vps800 chkservd: chkservd shutdown succeeded


Which can ONLY mean that someone turned them off. Someone at HostV. So I turned the services back on and sent them an email asking why. You know, because Paradox will stop working properly with this stuff turned off.

Response:
We haven't stopped the cron service within box and i can confirm that its working fine in the server.

This, I'm afraid, is a bare-faced, blatant, asshat lie or a really stupid error. Makes me think all they did was check the current status (working, because I fixed it) and decided to make the rest up and/or lie about it. Either way, it's not true and demonstrates exactly why these guys are a bunch of mindless jerks who'll be the first against the wall when the revolution comes.

Makes me wonder, though, how many servers on that virtual hosting node got cron turned off, and how long it'll be before HostV turns it back on. And what else they get up to under the guise of "systems administration" when I'm not looking.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Sun Nov 09, 2008 2:42 pm

Turns out, HostV Joseph (who sent me the "We haven't stopped the cron service..." message) is not a liar, he appears to just be a bit thick*.

Response from HostV Harry:
Yes, I did stop cron and chkservd on all vps's on that particular day and then started it after removing the respective upcp and cpbackup cron jobs. But it was restarted on all vps's as well. Infact it was setup in such a way that after removing those entries, it would start chkservd and cron and check if cron is running. else start it again.

See, that's much better. Of course, his script didn't work, which is why cron was still down 90 minutes later, and I've had a discussion with him about how he can check/fix the problem on the other servers, but at least he's actually responded to the problem properly.

I'm now in contact with "Cirtex John" who, I believe, is the owner of the company, or at least has some sort of authority over these guys, who will follow up on the issue and possibly do all kinds of unspeakable things to Joseph as a result. Or not, I guess we'll see what happens.

*Just in case anyone from HostV/Cirtex is reading this, please note that I said Joseph *appears to be* a bit thick. Well, he does. Seriously.
Image

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Sun Nov 09, 2008 5:40 pm

Response received from Cirtex John:

Thanks Stuart, I'll talk to Joseph as well

Regards
John Xie


Poor Joseph.

For anyone considering starting a hosting company, may I suggest this is rule #1:

"Don't have systems administrators / ex-support people as customers. They KNOW what kind of shit you pull behind peoples' backs."
Image

User avatar

stroby
Forum Addict
Posts: 141
Joined: Thu Jul 03, 2008 11:13 am
Location: England

Re: Lag Wars - The Ongoing Saga

Unread post by stroby » Thu Nov 13, 2008 7:27 am

You made Joseph cry, I hope your happy ;)

User avatar

Topic author
Gobberwart
Developer
Posts: 3396
Joined: Wed Jun 20, 2007 12:41 am
Location: Melbourne, Australia
Contact:

Re: Lag Wars - The Ongoing Saga

Unread post by Gobberwart » Thu Nov 13, 2008 10:32 am

I'm happy that lag o'clock appears to no longer be a daily event. Joseph = collateral damage :)
Image

Post Reply