jstanden
04-29-2008, 12:41 PM
Around 1:30PM Pacific on Monday (April 28th 2008) we had one of our Cerb4 Hosted machines start writing garbage data all over its RAID drives. This corruption turned MySQL database files into directories from random fragments of the disk (website files, etc). It was a complete mess.
We lost SSH access to the box around that time and initiated a remote reboot. When the machine came back up the primary partition was mounted read-only (which pretty much always means no good).
A full file system check is needed, but we decided to take the proactive step of moving all hosted sites from that machine to a standby server rather than letting the corruption get worse.
Our off-site backups didn't contain corruption, though a few of the on-server backups did. We do our weekly off-site backups early on Sunday mornings (when the traffic is lowest). In the worst cases, we had full databases for any site that's only missing Monday morning.
As soon as we realized what was going on, we shut down the scheduled tasks on that machine so we wouldn't be losing your e-mail to a corrupt environment. This buffered up mail in your POP3 accounts. This is our standard procedure, which gives you the ability to access your critical e-mail directly from your POP3 mailbox (with a “leave a copy of mail on server” client).
The way Cerb4 works, the corrupt database tables also caused the helpdesk parsers to lock (as designed) and started buffering up downloaded e-mail to the disk. This, combined with turning off scheduled tasks, should have prevented mail loss during the day.
We took a little longer than we would have liked in restoring some helpdesks this afternoon because we wanted to maximize the amount of recent data we were recovering, instead of just falling back to the backups when it wasn't always necessary (and would lose recoverable data). For some sites, the only corruption was on an infrequently changed table like 'worker' which we could restore independently. The majority of sites on that machine had no corruption at all (and the busiest sites were affected the most).
We currently have all sites and databases restored on a fresh server. We're importing the /storage directories which contain attachments and unparsed e-mail. These imports won't affect your ability to use your helpdesk, but you may not be able to open some older attachments until the process is done. We're establishing another standby server.
After things settle down a bit, we'll reflect on this 12 hour marathon and think about what we can do to get sites back in action quicker from a total machine failure. The off-site backups will continue to play a big role. Cerb4 is “componentized” enough that our hosted sites are incredibly portable on our network; but we also need to have our standby machines already mirroring the DNS for a seamless changeover.
If you're still having trouble accessing your hosted site it's likely just a DNS caching issue from the server migration. Make sure your *.cerb4.com site is resolving to the IP “72.232.216.42”.
Thanks to everyone for your patience during your calls, e-mails and livehelp chats today. This isn't a situation we're going to have to go through very often – in fact, it was a complete nightmare -- but it's at least slightly comforting to know our off-site backup process gave us day old data as our worst case option.
We lost SSH access to the box around that time and initiated a remote reboot. When the machine came back up the primary partition was mounted read-only (which pretty much always means no good).
A full file system check is needed, but we decided to take the proactive step of moving all hosted sites from that machine to a standby server rather than letting the corruption get worse.
Our off-site backups didn't contain corruption, though a few of the on-server backups did. We do our weekly off-site backups early on Sunday mornings (when the traffic is lowest). In the worst cases, we had full databases for any site that's only missing Monday morning.
As soon as we realized what was going on, we shut down the scheduled tasks on that machine so we wouldn't be losing your e-mail to a corrupt environment. This buffered up mail in your POP3 accounts. This is our standard procedure, which gives you the ability to access your critical e-mail directly from your POP3 mailbox (with a “leave a copy of mail on server” client).
The way Cerb4 works, the corrupt database tables also caused the helpdesk parsers to lock (as designed) and started buffering up downloaded e-mail to the disk. This, combined with turning off scheduled tasks, should have prevented mail loss during the day.
We took a little longer than we would have liked in restoring some helpdesks this afternoon because we wanted to maximize the amount of recent data we were recovering, instead of just falling back to the backups when it wasn't always necessary (and would lose recoverable data). For some sites, the only corruption was on an infrequently changed table like 'worker' which we could restore independently. The majority of sites on that machine had no corruption at all (and the busiest sites were affected the most).
We currently have all sites and databases restored on a fresh server. We're importing the /storage directories which contain attachments and unparsed e-mail. These imports won't affect your ability to use your helpdesk, but you may not be able to open some older attachments until the process is done. We're establishing another standby server.
After things settle down a bit, we'll reflect on this 12 hour marathon and think about what we can do to get sites back in action quicker from a total machine failure. The off-site backups will continue to play a big role. Cerb4 is “componentized” enough that our hosted sites are incredibly portable on our network; but we also need to have our standby machines already mirroring the DNS for a seamless changeover.
If you're still having trouble accessing your hosted site it's likely just a DNS caching issue from the server migration. Make sure your *.cerb4.com site is resolving to the IP “72.232.216.42”.
Thanks to everyone for your patience during your calls, e-mails and livehelp chats today. This isn't a situation we're going to have to go through very often – in fact, it was a complete nightmare -- but it's at least slightly comforting to know our off-site backup process gave us day old data as our worst case option.