Ayase Imageboard Archival Standard (Ayase)
The Ayase Imageboard Archival Standard was produced by the Bibliotheca Anonoma to handle the ever growing operations of Desuarchive and RebeccaBlackTech, by completely discarding FoolFuuka/Asagi and starting over with new paradigms and modern languages and technologies. It is made upof
- Operating System: CentOS/RHEL 8, or Ubuntu 18.04 under apparmor, or docker to be platform independent
- Database: PostgreSQL
- Scraper: Eve or Hayden (.NET C#)
- Middleware: Ayase (Python PyPy)
- Frontends: 4chan X, Clover, iphone app
- All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan.
- This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, migration t
- The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. It is chosen over sha256sums
- While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped.
- Even without hardware acceleration chips, sha256sums will still be quicker to generate than files can be read from disk, even SSDs. Even if a mass conversion of filenames has to be done, it would only take 4 days to deal with 30TB, though SSD caching might be necessary.
- They are to be stored in double nested folders.
- Ayase requires time to be stored in PostgreSQL datetimes, which also store timezones.
- Only UTC should be used as the timezone for newly scraped data. The timezone support is not an excuse to store in other timezones.
- The timezone support is only meant for compatibility purposes with prior Asagi data, given that they store time as US time (maybe Eastern) due to their past HTML scraping. Future scrapes are strongly advised not to replicate this behavior, local time should be up to the frontend to determine.
PostgreSQL JSONB Schema
if we GET json from the 4chan API, and always serve the same json to the user, why deconstruct and reconstruct into post focused sql records every time?
JSONB is different from text blobs of JSON too, its more like a NoSQL database and can be indexed. PostgreSQL is a better NoSQL than MongoDB as The Guardian has found.
Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added.
PostgreSQL RBAC Row Permission System
Unlike most sites, on the 4chan archives ghostposts are anonymous, so the vast majority of users do not have accounts.
Essentially, the only users that have accounts on our archives are janitors, moderators, and admins. Therefore, row level access with the RBAC could be issued so that on the accounts and permissions table, a user can only access their own row, which is then used as Role Based permission policies restricting their read/write permissions on the rest of the database (janitors can only read and report, moderators can delete posts, admins can take full actions).
We don't exclude the possibility of having a larger userbase, perhaps with premium users able to issue GraphQL queries with fewer limits or access special ranges of data, but using PostgreSQL RBAC for that is not a bad idea either. While it might sound like lunacy to issue SQL accounts to users, better to secure at the PostgreSQL database level (of which the RBAC is very mature) than giving full permissions to a user used by the API, which then haphazardly issues permissions on its own and introduces exploits, as it is normally done.
Kubernetes uses PostgreSQL RBAC successfully in production as seen here: https://kubedb.com/docs/0.11.0/guides/postgres/quickstart/rbac/
A seperate elastic search engine kept in sync with, but independent from the sql server, will replace Sphinxsearch which queries the mysql db