News/2017-12-03 Building and Rebuilding: Difference between revisions
Antonizoon (talk | contribs) |
Antonizoon (talk | contribs) |
||
Line 127: | Line 127: | ||
== Future Developments == | == Future Developments == | ||
The Bibliotheca Anonoma is researching several initiatives that could significantly reduce the resources and cost needed to run a 4chan archiver, or any website with heavy traffic for that matter. | |||
=== Sphinx Sharded Search Cluster === | |||
The Bibliotheca Anonoma has applied sharding to relieve the strain on Desuarchive's search servers. Before today, we were unable to offer search on /int/ nor could we import the Archive.moe dumps from October 2015, which would consume too many resources. The Sphinx server consumes a tremendous amount of computing power and RAM to conduct its queries, greater than any single one of our nodes can handle. | |||
But by combining the strength of 4 VPS servers each with 2 Xeon E5 CPUs and 14GB of RAM, we have managed to break this barrier. | |||
Official Documentation: http://sphinxsearch.com/docs/latest/distributed.html | |||
Example Config: https://github.com/sphinxsearch/sphinx/blob/master/sphinx.conf.in | |||
The official documentation is very sparse on how to do it, but we will put together a guide sometime. | |||
=== IPFS-based Full Image Hosting === | === IPFS-based Full Image Hosting === | ||
Line 134: | Line 148: | ||
[https://ipfs.io/ IPFS] (and it's blockchain subset, Filecoin) could possibly be the solution to all this mess since anyone with their meager home connection could operate as a "seed" of the images. We would still need to operate our own big storage seed to prevent rarely accessed images from being lost, but we would be able to do it from our Comcastic home connection. https://ipfs.io/ | [https://ipfs.io/ IPFS] (and it's blockchain subset, Filecoin) could possibly be the solution to all this mess since anyone with their meager home connection could operate as a "seed" of the images. We would still need to operate our own big storage seed to prevent rarely accessed images from being lost, but we would be able to do it from our Comcastic home connection. https://ipfs.io/ | ||
The problem with IPFS right now is that if most users aren't using IPFS, there is no bandwidth savings and few seeds. So unless you are running the IPFS software on your own computer, we have to host an IPFS Content Delivery Network (CDN)-style gateway to serve users the files. Thus, since we have to operate our own IPFS gateways and storage "seeds", it would operate as a less effective CDN (Content Delivery Network), so it is theoretically no better than Cloudflare upon our existing storage servers. | The problem with IPFS right now is that if most users aren't using IPFS, there is no bandwidth savings and few seeds. So unless you are running the IPFS software on your own computer, we have to host an IPFS Content Delivery Network (CDN)-style gateway to serve users the files (e.g. https://ipfs.io, https://neocities.org : still need central servers to view the decentralized sites, creating a bottleneck). Thus, since we have to operate our own IPFS gateways and storage "seeds", it would operate as a less effective CDN (Content Delivery Network), so it is theoretically no better than Cloudflare upon our existing storage servers. | ||
The solution is [https://github.com/ipfs/js-ipfs IPFS.js], which allows a user's own browser to allow users to connect directly to p2p seeds. But it doesn't yet have DHT support so it has trouble finding IPFS peers without having a torrent-style tracker. | The solution is [https://github.com/ipfs/js-ipfs IPFS.js], which allows a user's own browser to allow users to connect directly to p2p seeds. But it doesn't yet have DHT support so it has trouble finding IPFS peers without having a torrent-style tracker. | ||
Line 141: | Line 155: | ||
Ethereum (a very successful decentralized blockchain computing network) has the Metamask browser addon, this should be possible. | Ethereum (a very successful decentralized blockchain computing network) has the Metamask browser addon, this should be possible. | ||
=== Gladius: Blockchain-based Content Distribution Network (CDN) === | |||
[https://gladius.io/ Gladius] is a blockchain-based, decentralized Content Distribution Network (CDN). It allows website operators that need to handle more traffic on the fly, from just the Slashdot Effect or an actual DDoS attack, to call upon a large peer-to-peer network of Gladius nodes to serve as their front proxy. The Gladius nodes essentially "mine" by offering their otherwise idle internet connections to handle the increased traffic, and are compensated with Gladius tokens. | |||
It sounds like a pretty darn effective concept, and might be a better way to rent out your idle home internet connection than a riskier blockchain decentralized VPN service. Though ISPs might not like the sound of this regardless, so we are monitoring their progress. | |||
Note that [http://vitalik.ca/general/2017/10/17/moe.html simply because it is a good idea does not mean the token value will go up]. But if the token starts out worthless anyway, [https://uetoken.com there's nowhere to go but up!] |
Revision as of 04:34, 4 December 2017
Desuarchive and RebeccaBlackTech/Rbt.asia now serve text and thumbs again with upgraded Sphinx search servers, allowing /int/ search and allowing Archive.moe text dumps from October 2015 to be imported. Archived.moe has also returned to full service but will have a short maintenance downtime in a few weeks to relocate to a new server, which we are putting together.
To restore the full image server (cdn2.desuarchive.org), the Bibliotheca Anonoma has purchased and successfully tested a new $300 ASRock EP2C602-4L/D16 motherboard to replace the Foxconn T2491601 motherboard that failed on Server 2. Since it is a Xeon E5-2600 LGA2011 and DDR3 compatible motherboard, the majority of our parts and hard drives were compatible and the configuration was tested to work. Depending on how fast the parts arrive, deployment could occur on 2017-12-09 or 2017-12-16.
To replace our underpowered power supply, we also need to purchase a $150-300 2U server-grade 600W power supply with at least 6 molex and 5 SATA power (to reuse our current 12-bay server chassis), or alternatively a $470 24-bay Supermicro server chassis with redundant 800W power supplies.
At this time, we can bear the cost of those parts ourselves, but we don't have much more funds for future upgrades or hard drive replacements. To date we've collectively put in hundreds of hours of volunteer work and around $5000 into this equipment, not including monthly fees of $250-300 a month, and overcome many obstacles that would have vanquished our predecessors.
Notice that once Server 3 is deployed, there's enough room and resources to comfortably fit more expansions, such as providing more RAM for another sharded search server (even more RAM hungry than the Asagi scraper), the ability to archive many future boards of 4chan, or providing hard drives to import all full images from Archive.moe (dumps from Oct. 2015, no relation to Archived.moe) and other dumps.
Thus, if you would like to donate specifically to help with future upgrades or unexpected costs, drop in some funds to the following addresses:
- Stripe (click "Donate via Stripe"): https://desuarchive.org/_/articles/donate/
- Ethereum (and tokens): 0x6f2d0BcB4C6921f72122B436B0bf58F02c5F3656
- Bitcoin: 34mDFX8eURxgExeygobmPE5imf9V43WPGE (Fees are almost $3-6 with Bitcoin... Only send in large amounts, or exchange it for any of the other coins)
- Bitcoin Cash: 17YF3XBfgE2NAXRRCskJ2ZFnDrm7xAfCnT
- Litecoin: MLEV6qd4zaUDZgVPEfym5X4vAnDr7h4sXJ
- DASH: Xyba8qfcuLgcX1SYGmHBqt9qikiQYFp5gY
- ZCash: t1SouxYoqvtAUP22RbhVMPKNne6YUvdHECG
- Monero: 445Sxbv1LFMKyuxD5oZ4nmYdairWYaXY9We9aH1rnsAqYaFyg5GnnqqFXuH5YmQn418aDPsYhMPHyQbtauCgkvv7BqaqTt8
- Dogecoin: D9Euhvj4PZPSRLa8e57EQn6fj4fuupKVSm
Regardless of funding, the Bibliotheca Anonoma is also pursuing software improvements to Asagi or even an asynchronous replacement (work which benefits all Asagi-based 4chan archivers). Already, we have optimized our MySQL tables with TokuDB, and put together a sharded Sphinxsearch cluster that enables /int/ search and allowed us to import Archive.moe /a/ data from before October 2015.
Want to start your own 4chan archiver?
- If you'd like to start a 4chan archiver of your own, just read our guide and use 4plebs' fork of FoolFuuka along with Asagi.
- FoolFuuka/Asagi is very RAM hungry. Unless it is optimized, archival will continue to be as expensive and unsustainable as it is today.
- For a server that supports a publicly viewable thumbs only archival of all boards, it will require 64GB of RAM, a decent CPU after Sandy Bridge, and at least 500GB for thumbs (to hold all thumbs released on the Internet Archive from Archive.moe, 4plebs, or such). A decently configured 1U blade can cost $600, though you might scrape by with $300.
- For Desuarchive, 20-40TB of space is necessary just to hold its full images to date (not even that from the Archive.moe dump). We use a 12 bay hotswap hard drive backplane for this purpose, which go for around $150-300.
- A server blade colocation or VPS connection with at least 100mbps and 10TB bandwidth in/out is required to serve a site like Desuarchive. This can cost around $100 a month.
- Although we've been amazed at how much a home connection can handle, most ISPs in the US simply won't cut it (like the Comcastic 1TB limit) and you won't be able to stream Netflix or torrent on the same connection in any case.
- Don't be surprised if you face the same travails that we and our many predecessors have. But don't hesitate to ask us for assistance either, because we have long experience with setting up these systems.
- If you have any questions about setting it up, contact us at irc.rizon.net #bibanon , our discord server, our Matrix/Riot.im channel or at [email protected] .
Cost Breakdown
It costs everyone about $240 a month to host all these services, all paid out of pocket by Desuarchive and Archived.moe ($50 a month for them). (After the failure of Server 2, we deployed cloud services to host text and thumbs of Desuarchive and RebeccaBlackTech, making it about $300 a month for December.)
In total we have collectively put more than $5000 into the hardware: $1500 for our current 5x5TB 20TB ZFS and another $1500 for (our originally planned upgrade) an empty 5x5TB 20TB ZFS. We don't have much funding left for anything more.
- Colocation - A colocation is a datacenter where a single user can rent a space or a cabinet with a power plug and ethernet port to insert their own server blades into (thin ATX PC cases with large fans: essentially, a flat desktop).
- ($89/month) Colocation B: 1gbps. Contained Server 2, to be replaced with Server 3 or (if that doesn't work out) Server 1.
- ($99/month) Colocation A: 100mbps. Contains Server 1. We aim to discontinue this colocation whatever happens.
- VPS Servers - Virtual private servers that can be spun up or spun down anytime. The Archiver of Last Resort was running impressively, so we spun up a second node for Desuarchive and RebeccaBlackTech to run from.
- (39 euros/month) Archiver of Last Resort - A private FoolFuuka/Asagi archiver that has been scraping text and thumbs from 4chan since 2016. We obtain SQL dumps from this archiver in case of downtime.
- (39 euros/month) Desuarchive and RebeccaBlackTech - MySQL, Master Sphinxsearch, Asagi, and FoolFuuka live here.
- ($150/month) Multiple Sphinxsearch sharded servers for RBT & Desuarchive.
Server 3: Upgrade Options
This new combined server will be referred to as Server 3, and will be colocated at Colocation B which formerly hosted Server 2. Server 3 could be deployed on 2017-12-09 or 2017-12-16 (depending on when parts deliver), and once Server 3 is deployed all the full images are restored.
To construct Server 3 and restore full images, these are the parts we already purchased and tested to work.
- $300 ASRock EP2C602-4L/D16 - Works fine. Since we are pretty certain the issue was just with the motherboard we reused the majority of the components from Server 2 in this motherboard.
- It is pretty difficult to find a replacement for our $150 (ZT Systems/Chenbro) branded Foxconn motherboard, which, although it was a used server, came with a 1U case, 8x8GB = 64GB RAM (with a total of 16 slots), and two Xeon E5-2620 CPUs.
- We got an extended holiday warranty as insurance for the month, so even if we get a malfunctioning motherboard we can RMA it.
Power Supply Upgrade Options
'We need to have a more reliable power supply before we can deploy Server 3 and restore full images. These are the upgrade options we can take.
- $150-300 - A 2U 600W Power Supply with at least 6 Molex plugs and 5 SATA power plugs.
- If we can find a replacement power supply with these plugs and fits in a 2U server blade case, we don't have to pay extra for a whole new case and backplane, we can just use the old 2U (12 bay hotswap hard drive) server blade case from Server 2.
- If you happen to have any information regarding a power supply that is
- We cannot use Molex to SATA adapters, from experience they have been a serious fire hazard even when well shielded.
- $470 - Supermicro 4U Chassis SC846TQ, 24 bay hard drive hotswap, redundant 800W power supply
- This is probably the most ideal option but it strains our funds considerably.
- We could try to resell two both of the Chenbro 12 bay hotswap hard drive server blade cases for $150 each, minus shipping is $200 back.
The Dilemma of Archival
The key systemic issue with 4chan archival for each of our predecessors has been the ever increasing resource demands of the FoolFuuka/Asagi scraper. Not to mention that 4chan itself is gradually growing in traffic, if not as quickly as in years past.
As a result, a popular 4chan archiver will quickly outgrow or overtax its hardware, and unless they make upgrades and gather donations to do so, they will collapse.
Historically, 4chan archivers have thus been faced with three stark choices, each of which we have overcome and will continue to do so in the future:
- Note: Archive.moe was an unrelated group that archived most 4chan boards, but closed on October 2015 and released dumps to the Internet Archive. Archive.moe has no relation to Archived.moe, which is a successor to 4ch.be that rehosts 4chan archiver dumps from the Internet Archive and opened in 2016.
- Only archive thumbs. Most of the first 4chan archivers never archived full images in the first place, such as Easymodo and Green-oval. Thus, it takes just 2TB of hard drive space and a 480GB SSD to run thumbs only archivers like Archived.moe.
- However, full 4chan archivers such as our private Archive of Last Resort, 4tan (offline), and Archived.moe (which we run), will still face the intense RAM usage of the Asagi scraper (64GB of DDR3 RAM minimum).
- The Sphinx Search server also consumes a tremendous amount of resources. As such, we've recently deployed 4 shared Ivy Bridge Xeon servers to allow /int/ search to function, and also allow historical Archive.moe dumps from 2015 to be imported.
- Offload full images, don't archive them, or even prune them... Early on from 2009-2011 it was not economical for Fuuka archivers to save and display full images publicly. This situation has changed with the introduction of terabyte hard drives, but it still remains pretty expensive to host full images.
- Foolz.us (previous archiver of /a/ and most other boards) just plain deleted full images once they were out of space, and thus they are lost forever. In 2014 their successor Archive.moe (no relation to Archived.moe) screwed up the transfer of the full image backups, and by the time the team figured it out Foolz.us had deallocated and deleted that data forever.
- But at least RebeccaBlackTech, 4plebs, 4ch.be, and Nyafuu have historically offloaded full images to the Internet Archive. The Bibliotheca Anonoma also collects, retains, and uploads backups of other archivers.
- We aim to serve every full image we can on Desuarchive and Rbt.asia. We've put in $1500 to pay for 5x5TB hard drives, and have just replaced one with bad sectors (and thus rebuilding the ZFS array) for $150. We've also got another set of 5x5TB drives ready to be deployed in the new full image server, which can allow more Archive.moe dumps to be integrated.
- We have managed to keep the resource usage down by working hard to develop upgrades to the FoolFuuka and Asagi scrapers, from the use of TokuDB to Percona server, and sharded Sphinxsearch (which allows us to use multiple servers). We are also developing an asynchronous replacement for Asagi, as it was developed for a time before the 4chan API, and it shows significant RAM usage reductions.
- Quit, or be forced to by inevitable hardware failure. It is no coincidence that so many 4chan archivers have risen and fallen often with total data loss. Server hardware has a rotation lifetime of 3 years, especially if they spend every second of their lives online. Hard drives can fail unexpectedly, though the risk can be hedged with a RAID: if you have the money for it.
- Our full image server suffered a motherboard failure, but we've sunk in another $300 to purchase a motherboard and will probably purchase a new power supply or even 24 Bay case to go with it.
- Even if our public archivers are offline, we continually evade data loss with constant offsite backups, a private Archiver of Last Resort, and ZFS RAID arrays where we replace hard disks often. It's not cheap, but whatever happens this data will live on.
Desuarchive's Progress
Desuarchive began from humble roots two years ago as a private FoolFuuka archiver, but then took upon the responsibility of archiving most boards from Archive.moe (no relation to Archived.moe) in October 2015. Since then, as bandwidth needs have grown, Desuarchive has grown from a small Comcastic home lab to a VPS service to a colocated datacenter, and it also serves RebeccaBlackTech after it outgrew its home rig.
The Bibliotheca Anonoma continues to make planned hardware upgrades as usual, such as a case upgrade to support 12 more bays per server. Unfortunately, after Thanksgiving we have had to make a large unplanned purchase of $600 for motherboards and power supplies after the wholesale failure of the motherboard of Server 2.
We thus can return to service as before, but financial resources are running thin for any hardware upgrades (such as any future hard drive replacements), so donations are gratefully accepted.
In addition, we are also pursuing software improvements to Asagi or even an asynchronous replacement (work which benefits all Asagi-based 4chan archivers) and have already put together a sharded Sphinxsearch cluster to allow /int/ search and the import of Archive.moe /a/ data from before October 2015.
- FF upgraded - Feb 2015
- Current boards added, after Archive.moe (no relation to Archived.moe) published dumps and went offline - October 2015
- Sphinx server moved off to another local machine - November 2015
- Migrated over to MariaDB TokuDB - December 2015
- First host migration - December 2015
- Migrated to offsite cloud search - Mar 2016
- Desustorage domain name is held hostage by registrar - Jun 2016
- Migrated to server #2 - Mar 2017
- Upgraded to 4pleb's Foolfuuka fork - Mar 2017
- Integration of RebeccaBlackTech, due to overcapacity and end of life of the original server - Feb 2017
- Upgraded to better asagi with redis caching - Jun 2017
- Upgraded to better search networking - Sep 2017
- Migrated to Percona MySQL with TokuDB. Load decresed significantly - Nov 2017
- Server failure - Nov 22 2017
- Migrated to cloud server #3 - Nov 24 2017
- Finished archive.moe import of /a/ - Dec 2 2017
- Search server crashed - Dec 2017
- New sphinxsearch cluster created - Dec 4 2017
Future Developments
The Bibliotheca Anonoma is researching several initiatives that could significantly reduce the resources and cost needed to run a 4chan archiver, or any website with heavy traffic for that matter.
Sphinx Sharded Search Cluster
The Bibliotheca Anonoma has applied sharding to relieve the strain on Desuarchive's search servers. Before today, we were unable to offer search on /int/ nor could we import the Archive.moe dumps from October 2015, which would consume too many resources. The Sphinx server consumes a tremendous amount of computing power and RAM to conduct its queries, greater than any single one of our nodes can handle.
But by combining the strength of 4 VPS servers each with 2 Xeon E5 CPUs and 14GB of RAM, we have managed to break this barrier.
Official Documentation: http://sphinxsearch.com/docs/latest/distributed.html
Example Config: https://github.com/sphinxsearch/sphinx/blob/master/sphinx.conf.in
The official documentation is very sparse on how to do it, but we will put together a guide sometime.
IPFS-based Full Image Hosting
From our cost breakdown, it is clear that the expense of publicly hosting large amounts of full images is tremendous. (currently 20-40TB at $15000 and that's not even including the Archive.moe full images from before October 2015) There also isn't much of a methodology for users to contribute the files they already have on their computer. Most of all, if the centralized servers are unable to handle the bandwidth or run out of financial resources, they shut down and there's not much anyone else can do to bring it back, even if the whole site may be in everyone's browser caches.
IPFS (and it's blockchain subset, Filecoin) could possibly be the solution to all this mess since anyone with their meager home connection could operate as a "seed" of the images. We would still need to operate our own big storage seed to prevent rarely accessed images from being lost, but we would be able to do it from our Comcastic home connection. https://ipfs.io/
The problem with IPFS right now is that if most users aren't using IPFS, there is no bandwidth savings and few seeds. So unless you are running the IPFS software on your own computer, we have to host an IPFS Content Delivery Network (CDN)-style gateway to serve users the files (e.g. https://ipfs.io, https://neocities.org : still need central servers to view the decentralized sites, creating a bottleneck). Thus, since we have to operate our own IPFS gateways and storage "seeds", it would operate as a less effective CDN (Content Delivery Network), so it is theoretically no better than Cloudflare upon our existing storage servers.
The solution is IPFS.js, which allows a user's own browser to allow users to connect directly to p2p seeds. But it doesn't yet have DHT support so it has trouble finding IPFS peers without having a torrent-style tracker.
But maybe one day IPFS.js will have DHT support. The best bet is to help them make it possible so the majority of users can rely on it. Track the progress of IPFS.js here: https://github.com/libp2p/js-libp2p-kad-dht/pull/1
Ethereum (a very successful decentralized blockchain computing network) has the Metamask browser addon, this should be possible.
Gladius: Blockchain-based Content Distribution Network (CDN)
Gladius is a blockchain-based, decentralized Content Distribution Network (CDN). It allows website operators that need to handle more traffic on the fly, from just the Slashdot Effect or an actual DDoS attack, to call upon a large peer-to-peer network of Gladius nodes to serve as their front proxy. The Gladius nodes essentially "mine" by offering their otherwise idle internet connections to handle the increased traffic, and are compensated with Gladius tokens.
It sounds like a pretty darn effective concept, and might be a better way to rent out your idle home internet connection than a riskier blockchain decentralized VPN service. Though ISPs might not like the sound of this regardless, so we are monitoring their progress.
Note that simply because it is a good idea does not mean the token value will go up. But if the token starts out worthless anyway, there's nowhere to go but up!