News/2017-11-22 Server Failure: Difference between revisions

From Bibliotheca Anonoma
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
# '''There is no data loss of any kind'''.
# '''There is no data loss of any kind'''.
#* However, the server (Server 2) that hosts [https://desuarchive.org Desuarchive]'s cdn2 full images and [https://archived.moe Archived.moe] is unable to turn back on and needs to be completely replaced.
#* However, the server (Server 2) that hosts [https://desuarchive.org Desuarchive]'s cdn2 full images and [https://archived.moe Archived.moe] is unable to turn back on and needs to be completely replaced.
#* We believe that the server that currently runs Desuarchive/[https://rbt.asia RebeccaBlackTech]'s database (Server 1), while active now, has the same motherboard and could be vulnerable to similar issues in the future.
#* We believe that the server that originally ran Desuarchive/[https://rbt.asia RebeccaBlackTech]'s database (Server 1), while active now, has the same motherboard and could be vulnerable to similar issues in the future.
#* Drives and SSDs are fine. We're backing them up to offsite storage and will reuse them in a brand new server.
#* Drives and SSDs are fine. We're backing them up to offsite storage and will reuse them in a brand new server.
#* (UPDATE: 2017-11-25) Maintenance succeeded on Server 1 and it does not exhibit any similar issues. It will be used for Archived.moe from now on.
#* '''(UPDATE: 2017-11-25)''' Maintenance succeeded on Server 1 and it does not exhibit any similar issues. It will be used for Archived.moe from now on.
#* ''See the lengthy technical report below for further details.''
#* ''See the lengthy technical report below for further details.''
# We will have to obtain a brand new server, and use cloud services to host Desuarchive/RebeccaBlackTech until the replacement for Server 2 can be procured (around a few weeks or a month).
# We will have to obtain a brand new server, and use cloud services to host Desuarchive/RebeccaBlackTech until the replacement for Server 2 can be procured (around a few weeks or a month).
#* Public data service may be degraded until we manage to put it all together and deploy the new server in about a month. But we remain committed.
#* Public data service may be degraded until we manage to put it all together and deploy the new server in about a month. But we remain committed.
#* We will also continue to archive 4chan threads during the downtime from a private Asagi instance, so in the future it can be merged back in.
#* We will also continue to archive 4chan threads during the downtime from a private Asagi instance, so in the future it can be merged back in.
#* '''(UPDATE: 2017-11-25)''' We are considering options for procuring the new server and will most likely just get a new motherboard with the same components, as it is likely only the motherboard at fault due to BGA reflow issues from lead-free solder. See technical report below for more details. A new report will be produced when we have identified good options.
# We have also released 4chan archiver dumps to the Internet Archive regularly (as well as offsite tape backup) and will upload an updated dump for the current data as is.
# We have also released 4chan archiver dumps to the Internet Archive regularly (as well as offsite tape backup) and will upload an updated dump for the current data as is.
#* [https://archive.org/details/@archivemoe Archive.moe, Data up to June 2015]
#* [https://archive.org/details/@archivemoe Archive.moe, Data up to June 2015]
Line 62: Line 63:
It is a nonstandard configuration, but they were put together incrementally in this way since early on we did not have the funds to get a complete conventional server at the beginning. In any case, the servers worked fine for a whole year. Until now.
It is a nonstandard configuration, but they were put together incrementally in this way since early on we did not have the funds to get a complete conventional server at the beginning. In any case, the servers worked fine for a whole year. Until now.


Server 1 is currently working in a reduced configuration, and still needs to be repaired. Thus, we are migrating Desuarchive/RebeccaBlackTech to a VPS service temporarily. Server 2 is not working, and thus archived.moe is offline for now.
Server 1 is currently working <s>in a reduced configuration, and still needs to be repaired.</s> (repaired successfully) Thus, we are migrating Desuarchive/RebeccaBlackTech to a VPS service temporarily. Server 2 is not working, and thus archived.moe is offline for now. Archived.moe will thus be moved to Server 1.


=== Server 1 (Still works... For now) ===
=== Server 1 (Still works... For now) ===

Latest revision as of 01:05, 26 November 2017

  1. There is no data loss of any kind.
    • However, the server (Server 2) that hosts Desuarchive's cdn2 full images and Archived.moe is unable to turn back on and needs to be completely replaced.
    • We believe that the server that originally ran Desuarchive/RebeccaBlackTech's database (Server 1), while active now, has the same motherboard and could be vulnerable to similar issues in the future.
    • Drives and SSDs are fine. We're backing them up to offsite storage and will reuse them in a brand new server.
    • (UPDATE: 2017-11-25) Maintenance succeeded on Server 1 and it does not exhibit any similar issues. It will be used for Archived.moe from now on.
    • See the lengthy technical report below for further details.
  2. We will have to obtain a brand new server, and use cloud services to host Desuarchive/RebeccaBlackTech until the replacement for Server 2 can be procured (around a few weeks or a month).
    • Public data service may be degraded until we manage to put it all together and deploy the new server in about a month. But we remain committed.
    • We will also continue to archive 4chan threads during the downtime from a private Asagi instance, so in the future it can be merged back in.
    • (UPDATE: 2017-11-25) We are considering options for procuring the new server and will most likely just get a new motherboard with the same components, as it is likely only the motherboard at fault due to BGA reflow issues from lead-free solder. See technical report below for more details. A new report will be produced when we have identified good options.
  3. We have also released 4chan archiver dumps to the Internet Archive regularly (as well as offsite tape backup) and will upload an updated dump for the current data as is.
  4. If you'd like to start a 4chan archiver of your own, just read our guide and use 4plebs' fork of FoolFuuka along with Asagi.
    • FoolFuuka/Asagi is very RAM hungry. Unless it is optimized, archival will continue to be as expensive and unsustainable as it is today.
    • For a server that supports a publicly viewable thumbs only archival of all boards, it will require 64GB of RAM, a decent CPU after Sandy Bridge, and at least 500GB for thumbs (to hold all thumbs released on the Internet Archive from Archive.moe, 4plebs, or such).
    • For Desuarchive, 20-40TB of space is necessary just to hold its full images to date (not even that from the Archive.moe dump).
  5. We wrote a detailed article about the history of 4chan archivers and the Fuuka archiver itself, which provides context for how we ended up with this responsibility.
    • Don't be surprised if you face the same travails that we and our many predecessors have. But don't hesitate to ask us for assistance either, because we have long experience with setting up these systems.
    • If you have any questions about setting it up, contact us at irc.rizon.net #bibanon , our discord server, our Matrix/Riot.im channel or at [email protected] .

How did we end up with this responsibility?

The Bibliotheca Anonoma is an organization of anons aiming to archive Internet Folklife, with a focus on 4chan and imageboard culture. We meet daily at: irc.bibanon.org #bibanon . We have conducted our work since 2012 on inspiration from the Yotsuba Society, Shii's Wiki, and the ancient Wikichan.

The output of our work can be found in this wiki, as well as the Github wiki. This primarily entailed discovering and archiving Stories and projects that could only have been made on the Internet, as well as archiving websites that would otherwise have been lost to the Internet Archive, and producing custom web scraper scripts to facilitate the process.

Unfortunately, the websites that archived 4chan regularly failed or lost interest. The worst failure was the Chanarchive in 2014, whereby the sysadmin disappeared and the server eventually failed with no one to save it, as well as the botched transfer of Archive.moe upon its creation, and worse, its failure in October 2015, whereby they released dumps from up to June 2015.

Our opinion is that a community deserves to tell its own history and achievements in its own words, and not simply be remembered in the records or claims of others.

Thus, since then, we worked to ensure that the dumps of successor archivers were backed up on tape or the Internet Archive, so that anons would never have to suffer the loss of the Chanarchive again.

Over time, we collected the most dumps, made contact, assisted, and encouraged many archivers, and gained the most experience about how to utilize the aging, ram hungry, but reliable FoolFuuka/Asagi Archiver. We've wrote more about the history of the Fuuka/FoolFuuka/Asagi archiver here:

http://archiveteam.org/index.php?title=4chan

http://archiveteam.org/index.php?title=4chan#History_of_the_Fuuka_Archiver

As you will read, many organizations have come up to the task and many have failed. We are simply the ones who lasted up to today, assisted archivers in their efforts, or at least ensured that their dumps live on backed up on tape or the Internet Archive.

But hey, you too can run a 4chan archiver, and if you talk to us or read our documentation, we can help you out if you have any issues (irc.rizon.net #bibanon). Because we hope that future endeavors can last as long and pass down the data just as long as we do. Just don't be surprised if you face the same travails that we and our predecessors have.

How did we manage to work together with the Archivers?

It was not too far of a jump from helping each other back up, to helping each other with server issues, to helping each other host in a real colocated dataceter.

  • Desuarchive/Desustorage - The admin's site took up the mantle of Archive.moe quite quickly. Unfortunately, it grew too big for his Comcastic home lab to host. Thus he had to move.
    • We also eliminated its ads after setting up our new servers.
  • RebeccaBlackTech/rbt.asia - The admin ran a decent Dell server setup in Finland for years as the longest lasting 4chan Fuuka archiver. Unfortunately, since early 2017, that hardware is no longer able to handle the archival load that the site handles now: every upgrade has been taken, every drive slot has been used, the home connection has hit the maximum bandwidth it can handle, the original Fuuka code was prone to constant failure, and the admin has historically had to drop boards and offload images to keep load down.
    • RebeccaBlackTech 2.0 has the text, thumbs, and images from the original Fuuka archiver, but updated to function as a FoolFuuka frontend merged with Desuarchive's Asagi archiver for greater efficiency.
    • RebeccaBlackTech never has and will never have ads.
    • Images are no longer offloaded from RebeccaBlackTech like things used to be on the old server.
  • Archived.moe - Content fetched by archived.moe caused one of their first upstream providers to null-route the connection abruptly. They also had trouble with the expense of hosting on VPS services, since they couldn't get the configuration they wanted at the price they needed.

How were the servers set up?

The Bibliotheca Anonoma operates two servers in two different colocation areas.

It is a nonstandard configuration, but they were put together incrementally in this way since early on we did not have the funds to get a complete conventional server at the beginning. In any case, the servers worked fine for a whole year. Until now.

Server 1 is currently working in a reduced configuration, and still needs to be repaired. (repaired successfully) Thus, we are migrating Desuarchive/RebeccaBlackTech to a VPS service temporarily. Server 2 is not working, and thus archived.moe is offline for now. Archived.moe will thus be moved to Server 1.

Server 1 (Still works... For now)

Note: As of November 25th, 2017, Server 1's maintenance was completed and the offending drive was removed. It will be used to host Archived.moe instead. Desuarchive and RebeccaBlackTech/rbt.asia were still moved off with database and thumbs to VPS services as a contingency measure (and will be moved to the new server replacing Server 2 in the future), but we do not foresee any further problems with Server 1.

Server 1 operates wiki.bibanon.org, Desuarchive, and rbt.asia. It was produced for the purpose of operating Desuarchive.

RebeccaBlackTech/rbt.asia moved in after anounyym1's Finland home lab was unable to continue to cope with the growing storage and RAM needs of its archives, and is effectively merged with Desuarchive.

It is configured with the motherboard and case of a stock ZT Systems 1U SR-00847, which holds 4 drives but is not hotswappable.

Server 2 (Inoperable)

Server 2 operates desuarchive's cdn2 full size image storage and archived.moe. It was produced as a joint venture with archived.moe in Summer 2017.

Desuarchive's full size image archive was originally hosted at their home lab, but the connection was just plain Comcastic. Content fetched by archived.moe caused one of their first upstream providers to null-route the connection abruptly. They also had trouble with the expense of hosting on VPS services, since they couldn't get the configuration they wanted at the price they needed.

It uses the same motherboard as server 1, but we used a 12 bay 2U Chenbro RM23212 case for this motherboard (ditching the old non-hotswappable ZT Systems case) to provide it with enough space for a 20TB ZFS RAID, as well as an LSI 9211-8i HBA flashed to IT mode to support them.

What happened?

Note: One experienced sysadmin skilled with scrapping servers states that it may be a BGA reflow issue, as 90% of motherboard failures are (exacerbated due to the use of lead free solder), where cycles of heating and cooling, expantion and contraction will flex the weak joint. This may be why there was no issue while the server was running or during quick maintenance, but when the server cooled down for a while the problem would be triggered. It is one of those problems where any minor movement of the motherboard could cause it to function or not function, leading to the characteristic issue of no clear culprit for the failure.
Note: The risk of this sort of motherboard failure is cumulative over the years especially as servers remain on 24/7/365 and especially if they have cooling problems, so this is why most corporations rotate their servers every 3 years (whereas we bought it at... 4 years old, for a good deal). Although it might be fixable with a reflow station and an skilled scrapper, we might as well just buy a new motherboard or new server entirely. On the flip side, everything else: the CPU, RAM, SSDs, Hard Drive, Case & Backplane should work fine, which might save us money. Though we probably will look for a new in box motherboard now.

While conducting routine maintenance to add an HBA that allows (empty) 5x5TB Seagate drives to be supported, Server 2 failed to turn on, when more than four drives were attached. At other times, it powered on but failed to post to VGA output, despite powering on with fans at full blast: and the keyboard lights did not operate as would be normal behavior. We finally attempted to get it to a reduced state to serve at least Archived.moe normally, but once we moved the rig to the rack yet again it would not POST to VGA output and continue to boot.

We're about as bewildered as you are. As Murphy's Law states, everything that could go wrong, went wrong with the hardware in the server (but at least there was no data loss).

For the techs, nearly every problem described below was the first time it ever happened in their experience. There is no clear culprit for the problems with Server 2: it could be the drives, the CPU, the backplanes, the power supply, but they all work in isolation.

We fear that the same thing may affect Server 1 if it is ever turned off, so we did not undertake the chassis replacement or other routine maintenance with Server 1, leaving it in a reduced state. However, we are instead taking the time to move Desuarchive and RebeccaBlackTech to a VPS service temporarily so we can attempt the maintenance on Server 1.

Here is our maintenance log:

(First, some definitions of server states):

  • POST to VGA Output - The normal functional state of the server. The machine shows all BIOS booting information and leaves the user at BIOS setup if there are no bootable devices, or GRUB for Proxmox Debian. The numlock LED lights up when pressed.
  • Powered on, but failed to POST to VGA Output - There is no discernable output from ethernet or VGA, and the numlock LED on the keyboard does not light up when pressed. We made sure to test two VGA screens to reduce the chance of false positives.
  • Fails to Power On - The power button causes the fans to twitch a little at first, but the server does not continue to power on.
  • Power Button Disconnected - The power button jumper cable is not connected. We ensured early on that the power button is not the source of any power problems.

Server 2 (Inoperable)

Server 2 operated without incident in the Chenbro 2U RM23212 hotswappable chassis since September 2017. We powered it off during Thanksgiving Break 2017. We aimed to attach a second HBA to Server 2 to move the 2.5" SSDs to it, and add 5 more 5TB Seagate drives directly on the motherboard's SATA ports (using the case included SAS to SATA (Red SAS-> SATA motherboard)). The second HBA LSI 1068e worked fine without SAS cables, but had to be flashed to IT mode.

To prevent the possibility of accidentally flashing the HBA LSI 9211-8i card (which already has IT mode) with the wrong firmware, we detached the card and detached it from the SAS cables.

However, we discovered that the motherboard itself had an internal SAS port on the motherboard, hidden in the corner. Thus, we tried an internal SAS to 4 SATA cable (Blue SATA->SAS motherboard) to see if it would read a drive, whereby there would be no need for the second HBA card, and we could then just connect the motherboard directly to a Chenbro SAS backplane row.

My Server Won't Turn Back On?

The server did not turn on, despite the fact that the green motherboard LEDs lit up after we plugged the power supply into the wall. We tossed out the internal SAS to 4 SATA cable (which the techs reasoned was not too likely to be the cause of the problem), but the same thing happened. We then checked and found that the power button was detached from the motherboard (the wire is a simple jumper cable and may have slipped off when we ejected the LSI 9211-8i card). We looked at the schematics to place the power button back in the correct place.

Unfortunately, the server still would not power on. The power button was tested to function correctly on another motherboard, and we worked with a datacenter tech to test the power supply. We also manually shorted the connection to simulate the power button, but to no avail. The power supply was determined to be in good health, but may possibly have had some issues with the 5V lane or the capacitors. We thus replaced the power supply with another 600W power supply the techs had on hand, but the machine still failed to power on, so we brought back the original power supply.

To begin our troubleshooting, we ejected all the existing 5x5TB Toshiba, SSDs, and 2x4TB Toshiba drives and packed them up (thus keeping the data safe from any issues with the server), and attempted to turn the machine on with the minimum configuration to isolate the problem. The machine did turn on with fans at full blast, but failed to POST to VGA output.

We first tested the RAM sticks one by one and the server powered back on to ensure that it would POST to VGA output, to see which RAM would prevent the server from powering on. Thankfully, the server consistently powered on with POST to VGA output, reaching BIOS setup after testing every RAM stick.

My Server Can't Power Up More Than 4 Drives?

We then inserted the new 5x5TB Seagate drives: one in the bottom row, 4 in the middle row to test whether the server was back to normal. Unfortunately, it failed to power on again, but this time the fans twitched a little, showing that the server did respond to the power button but failed to turn on.

We tested power on with all the Seagate drives one by one (each time with POST to VGA output, reaching BIOS setup) and found that anytime more than three 3.5" drives were attached on the middle backplane, the system would fail to power on. (However, the server would still power on with all 2.5" drives attached). This was also replicated with the bottom backplane, or one on the bottom and two on the middle backplane. (The top backplane had to be connected to the second HBA card to read, so we did not test it at this point).

To the techs, there was no discernable reason why a server that functioned fine for months with 7 drives with a 600W power supply would now fail to power on with any more than 4 drives. We decided to replace the power supply with the exact model from the Chenbro case we planned to use on Server 1, giving up on the prospect of upgrading Server 1 with it.

My Server Has Backplane Problems?

Unfortunately, the problem was not resolved. A tech recommended that we try to use the top backplane instead of the middle backplane, so we moved the SAS cable from the middle backplane to the top backplane. This time, we could power up more than 5 drives using the top backplane and the bottom backplane together, which gave us all hope that we could at least restore the server to normal.

We then inserted the 5x5TB Toshiba drives with the 20TB desuarchive images back into the server in the same positions as the 5x5TB Seagate drivers. Unfortunately, the server now powered on, but failed to POST with VGA output, taking us back to square one. (We've checked and the data on the drives was not affected by this test.)

We again had the techs provide an Ablecom power supply that only had 1 rail and 400W, but was higher quality. We disconnected the motherboard from the rest of the case and the backplane to reduce the possibility of it affecting the test. Unfortunately, the server still failed to POST to VGA output, even after a CMOS reset, with just a single piece of RAM, then paired pieces of RAM.

My Server Can't Use a CPU?

Finally, we ejected CPU 2 and moved all the RAM to the slots of CPU 1, as one tech suspected that the memory controller of CPU 2 was causing issues. Lo and behold, the server powered up with POST to VGA output, reaching BIOS setup.

As a result, we aimed to restore the server to at least minimum working configuration to operate Archived.moe alone, leaving only 2x4TB Toshiba drives. This would not serve Desuarchive full images from the 5x5TB Toshiba drives. We tested that even though we can hotswap the drives in without the server turning off, the server will not POST to VGA output with them attached, putting us in a precarious situation and putting the data at risk, which we refused to do.

Although the (empty) Seagate drives work on power up, one (empty) Seagate drive actually ended up not functioning, which we will RMA. We cannot provide the levels of storage and redundancy needed without 5 drives, neither can we easily add a drive into it later, so we elected not to put in the Seagate drives.

My Server Is Inoperable?

And unfortunately, lo and behold, after racking the server in this configuration: it failed to POST to VGA output on the VGA monitors on the crash cart. We lugged the server back to the workbench and could not get POST to VGA output either.

At this point we gave up after 7 hours of fruitless labor and took Server 2 and all related equipment home, and did not visit Server 1.

At home, we immediately began backing up the data off the RAIDs to offsite storage. We also found that Server 2 would not POST to VGA output under any circumstances, even with a CPU attached or RAM, or even the original ZT Systems Power Supply. The original motherboard from the Chenbro case lacks CPUs so we are better off buying something decent from the start.

Conclusion

In conclusion, there were multiple lurking and inexplicable problems with the entire build of Server 2. It was not possible for either us or the many techs who assisted us to pinpoint the source of the problem, as each component works in isolation but not as a system for some godforsaken reason.

Thus, we will replace Server 2 with a brand new quality server that will be built right from the start. This process of replacing Server 2 may take up to a month, and will be done at our own expense.

No data was lost in this server failure. The 5x5 Toshiba hard drives with ~20TB read correctly and have been moved out of Server 2 to begin syncing to our offsite backups, safeguarding the data. We also have a private Asagi archiver that will be scraping in the meantime.

Server 1 (Still active... For now)

Note: As of November 25th, 2017, Server 1's maintenance was completed and the offending drive was removed. It will be used to host Archived.moe instead. Desuarchive and RebeccaBlackTech/rbt.asia were still moved off with database and thumbs to VPS services as a contingency measure (and will be moved to the new server replacing Server 2 in the future), but we do not foresee any further problems with Server 1.

Server 1 is still currently active and hosting Desuarchive and RebeccaBlackTech/rbt.asia in a reduced state. We will begin moving both off to a VPS service temporarily starting November 24th, and likely move archived.moe here.

Around October 2017, one 5TB drive began to exhibit 8 bad sectors. Although the amount of bad sectors did not increase and data was otherwise recovered by ZFS, this made it crucial for us to replace this drive and rebuild the RAID eventually.

The following warning/error was logged by the smartd daemon:

Device: /dev/sdf [SAT], 8 Currently unreadable (pending) sectors

Device info:
TOSHIBA HDWE150, S/N:17NEKFH2F57D, WWN:5-000039-78bd82d2b, FW:FP2A, 5.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.

In a case with hotswappable bays, it would be simple to replace the drive without turning the server off. Unfortunately, the ZT Systems chassis that this motherboard came with did not include such conveniences.

Thus, during Thanksgiving Break 2017 we planned to replace the chassis with the Chenbro hotswappable case as was done with Server 2, replace the 5TB Toshiba drive with 8 bad sectors, and add an HBA to increase the amount of drives that Server 1 can handle.

Unfortunately, due to the inability for Server 2 to power back on, we were unable to complete the change and have left it as is during Thanksgiving Break.

However, we may still make the attempt to replace the 5TB drive and move the 480GB SSD supporting archived.moe here. Hopefully, the issue in Server 2 will not affect Server 1. To facilitate this gambit, we will move Desuarchive and RebeccaBlackTech/rbt.asia to a VPS service so they will not be affected.