Hard Drive Test
It is recommended to stress test all hardware capabilities of the server before using it in production. This can take a day for the long SMART test, so do it as soon as you get the server.
For the most part, the RAM, the motherboard, and the CPU should be fine to use, just run a self test on everything. Even though they're desktop models.
The one thing you should be really worried about is the hard drives. Make sure to RAID them, make good backups, and alert support about any bad sectors or bad SMART reports.
I got one of the 31 EUR servers a fews days ago. Both disks had a lot of errors. Ticket to Hetzner with smartctl output, they replaced them 1 hour later, downtime 30 minutes. All this on a Friday evening (those living in Germany know what I mean ;-)
You can run SMART tests to check the usage lifetime, and how healthy the hard drives are.
Warning: SMART tests are not an exact gauge of drive heath. A rule of thumb is that if SMART finds errors, the hard drive may be dying: but even if it doesn't, the drive may still have unreported problems.
Run a short SMART test to figure out how long the drive has been active in the past.
smartctl -t short /dev/sda
Run a longer SMART test to gauge the health of the drive.
smartctl -t log /dev/sda
Understanding SMART Reports
- Reallocated Sector Count - bad sectors in the past; this might have caused problems in the past but does not have to; drives replace weak sectors as a precaution which may never have caused any problems.
- Current Pending Sector - THE MOST DANGEROUS smart attribute; this should ALWAYS BE ZERO or you have severe problems! This can be either weak electric charge with insufficient ECC correction ability -OR- it can be physical damage. Writing to this sector will solve the problem; if there was physical damage it will be realloacted by a reserve sector and the Reallocated Sector Count raw value will increase.
- UDMA CRC Error Count - cabling errors; if this is higher than 1000 and increasing you have severe cabling problems; under 100 does not need to trigger any alarm. Technically this means the receiving end did receive a corrupted version of the data that was sent by the transmitter; the corruption was detected by CRC which means the data is NOT accepted and the request will be sent again. Unless you see very high values or it keeps increasing steadily, this usually is not a big issue.
Fixing Current Pending Sectors
If you notice any Current Pending Sectors, be afraid. This means that some sectors currently in use have failed.
What you must do depends on whether you have important data on the drive or not:
- Important Data is on the Drive - Back up all the data you can to another hard drive, some of it may already be irrevocably damaged.
- If some important data is corrupt, stop using the drive entirely and send it to a data recovery service.
- Nothing of Value was Lost - Back up any data you need to another hard drive.
- Make sure you grab absolutely everything that may be necessary, we will wipe the drive in the next step.
If you don't have any data you need to recover, you can zero out the entire drive. When the bad sectors are written to, the hard drive controller will automatically disable those sectors, remap them to spare sectors, and move on (these are represented in the Reallocated Sector Count).
- Windows - use Western Digital's zero out system.
- Linux - Stackexchange - Zero Out bad blocks
hdparmto figure out block size:
sudo hdparm -I /dev/sdb | grep -i physical
- badblocks read/write test
Once you're finished, run a SMART test again. There should be no more Current Pending Sectors, and more Reallocated Sector Counts. You should still be very cautious of using this drive, as bad sectors may spread.