2017-07-01: Server Downtime / Database Restore / Lost Posts

Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
Switches things on and off again
Joined 2000
Paid Member
Tonight our database suffered an irreparable data corruption and was unable to be repaired.

Several different methods were attempted to salvage the data, so that nobody had the hard work they put into posting lost. Unfortunately none of these methods were able to bring the database back into a stable state.

We then reviewed our options, which included two different forms of database backups, one from ~4 hours prior to the corruption, and one from ~10 hours. The more current backup was chosen (a text based SQL backup) as it was newer, and being as that it would ensure the database was restored "from the ground up" with no possible corrupted vestiges lurking in the background that might cause another issue down the line, we (slowly) restored everything from this SQL backup.

The good news is the backup appears to have worked perfectly, the bad news is we have lost ~4 hours of posts. We have since also upgraded our database server software to the latest patch level, which hopefully will eliminate whatever was the cause of the occasional database hangs we've experienced for the past few months, as well as tonight's data corruption. It's also possible that the binary database files themselves contained some issues created a long time ago, which then came to a head tonight, in which case restoring from SQL will let us start with a "clean slate". Hopefully this approach will lead us to improved stability in the future.

I will use this chance to perform a review and do a few other upgrades that have been lagging for various reasons.

This post will be updated as more news arrives and as I watch to make sure things "settle in" ok.

It was great to see the first of several layers of backups worked successfully in this emergency situation. However I do extend apologies to those affected and who lost their posts and will do everything I can to ensure this doesn't happen again.
 
Last edited:
Just another Moderator
Joined 2003
Paid Member
Thanks Jason, four hours is not too bad really (24 hours is not uncommon in situations like this). I suspect it was not a lot of posts. I've sent you a couple of screen shots which show the front page from just after you got it back online the first time. There were only 32 posts in the three hours prior to that time.

Tony.
 
Just another Moderator
Joined 2003
Paid Member
Below is an almost complete list of what was lost minus a few posts that managed to be made in between Jason posting the original outage message and it going down again (which I didn't screen shot). I've checked and the Led drivers post in the second attachment is still there, but the one above it is not.

Tony.
 

Attachments

  • lost_posts1.png
    lost_posts1.png
    174.3 KB · Views: 240
  • lost_posts2.png
    lost_posts2.png
    53.8 KB · Views: 236
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.