Ayase: Difference between revisions

From Bibliotheca Anonoma
m (Adds 4plebs-x repo link)
(17 intermediate revisions by 2 users not shown)
Line 5: Line 5:
== Reference Implementation ==
== Reference Implementation ==


* Operating System: CentOS/RHEL 8
* Operating System: CentOS/RHEL 8, or Ubuntu 18.04 under apparmor, or docker to be platform independent
* Database: PostgreSQL
* Database: PostgreSQL
* Scraper: Ena or Hydrus (.NET C#)
* Scraper: [https://github.com/bibanon/eve Eve] or [https://github.com/bbepis/Hayden Hayden] (.NET C#)
* Middleware: Ayase (Python PyPy)
* Middleware: Ayase (Python PyPy)
* Frontends: 4chan X, Clover, iphone app
* Frontends: 4chan X ([https://github.com/pleebe/4plebs-x 4plebs-x]), Clover, iphone app


== Specifications ==
== Specifications ==


=== Files ===
=== Full Images ===
 
The Futabilly engine, which archives 8chan Infinity-Next/Vichan, is able to scrape and store images as sha256sum because this engine uses that as their checksumming algorithm. Unfortunately, 4chan uses MD5sum which is sufficient for their short term storage needs, so we would have to do a conversion to sha256sum, but it is very worth it given the collision risks and the massive benefits of cross board deduplication.
 
Note that it is only possible to get a sha256sum of full images when doing full image scraping. Thumbs only scrapers will have to follow the procedure noted in the next section.


* All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan.
* All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan.
** This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, migration t
** This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, this does require migration of filenames to sha256sum and their calculation, but on powerful server hardware especially with hardware acceleration, it should not be feared. More on md5 collisions as they relate to 4Chan and its archives: [[Ayase/MD5_Collisions]]
** The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. It is chosen over sha256sums
** The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. By that point if it is cracked there will be more issues than mere collisions to worry about.
** While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped.  
** While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped.  
** Even without hardware acceleration chips, sha256sums will still be quicker to generate than files can be read from disk, even SSDs.  Even if a mass conversion of filenames has to be done, it would only take 4 days to deal with 30TB, though SSD caching might be necessary.
** Even without hardware acceleration chips, sha256sums will still be quicker to generate than files can be read from disk, even SSDs.  Even if a mass conversion of filenames has to be done, it would only take 4 days to deal with 30TB, though SSD caching might be necessary.
* They are to be stored in double nested folders.
* Image files are to be stored in double nested folders based on characters in the filename, as seen in Futabilly. This therefore provides a near perfect random distribution where each folder is on average the same size.
** This could allow partitioning of the storage between each alphanumeric range, or if even necessary at huge sizes, single servers.
** There could exist an nginx config on the server s1, where the scraper is located. We first have the user start accessing s1, which checks if the sha256sum is on local disk (maybe checking a redis index to avoid disk i/o).
** if it is not found, then the nginx config redirects sha256sum ranges 0-5 to s2, a-g to s3, h-l to s4, etc. These ranges can be modified as they get copied around or rebalanced.
* sha256sum should be appended as an extra value to each post right next to md5sum. Then there should be an images table with a SERIAL primary key, sha256sum as unique key, and a foreign key for the datastore location that the specific image is currently in. 
** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest.
** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though).
 
=== Thumbs Only Scraping ===
 
When thumbs only scraping is used, it is not possible to record the sha256sum of the target image, so the sha256 key in posts is left undefined.
 
Nevertheless, sha256sum for thumbnails should always be generated and stored under the key sha256t. This checksum refers to the thumbnails own sha256sum filename and is unrelated to the image filename.
 
Third party 4chan clients wishing to add support for the 4chan archives would therefore use sha256t + .jpg as the thumb filename instead of tim + s.jpg. To provide backwards compatibility though, the foolfuuka reference style desuarchive.org/thumb/1234/56/123456789s.jpg can be used to resolve to whatever the file is currently named, whether under the asagi way or the sha256sum way. This is to allow seamless support while images get renamed, and ensure that urls never break.


=== Time ===
=== Time ===
Line 26: Line 44:
* Ayase requires time to be stored in PostgreSQL datetimes, which also store timezones.
* Ayase requires time to be stored in PostgreSQL datetimes, which also store timezones.
* Only UTC should be used as the timezone for newly scraped data. The timezone support is not an excuse to store in other timezones.
* Only UTC should be used as the timezone for newly scraped data. The timezone support is not an excuse to store in other timezones.
* The timezone support is only meant for compatibility purposes with prior Asagi data, given that they store time as US time (maybe Eastern) due to their past HTML scraping. Future scrapes are strongly advised not to replicate this behavior, local time should be up to the frontend to determine.
* The timezone support is ''only meant for compatibility purposes with prior Asagi data'', given that they store time as US time (maybe Eastern) due to their past HTML scraping.  
* ''However, Future scrapes are strongly advised not to store any timezone other than UTC'', local time should be up to the frontend or really the user's browser/app to determine.
 
=== Posts ===
 
* In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] This may have been due to its historical use as an HTML scraper. Whether we should continue to replicate this is an open question as it does not seem to have major data loss and should still be compatible as output, but my opinion is that especially as it will be put into json, we should escape it properly so that the JSON engine does not have to do it by itself.


=== PostgreSQL JSONB Schema ===
=== PostgreSQL JSONB Schema ===
Line 35: Line 58:


Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added.
Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added.
=== PostgreSQL RBAC Row Permission System ===
Unlike most sites, on the 4chan archives ghostposts are anonymous, so the vast majority of users do not have accounts.
Essentially, the only users that have accounts on our archives are janitors, moderators, and admins. Therefore, row level access with the RBAC could be issued so that on the accounts and permissions table, a user can only access their own row, which is then used as Role Based permission policies restricting their read/write permissions on the rest of the database (janitors can only read and report, moderators can delete posts, admins can take full actions).
We don't exclude the possibility of having a larger userbase, perhaps with premium users able to issue GraphQL queries with fewer limits or access special ranges of data, but using PostgreSQL RBAC for that is not a bad idea either. While it might sound like lunacy to issue SQL accounts to users, better to secure at the PostgreSQL database level (of which the RBAC is very mature) than giving full permissions to a user used by the API, which then haphazardly issues permissions on its own and introduces exploits, as it is normally done.
Kubernetes uses PostgreSQL RBAC successfully in production as seen here: https://kubedb.com/docs/0.11.0/guides/postgres/quickstart/rbac/
Official Documentation:
https://www.postgresql.org/docs/current/user-manag.html
https://www.postgresql.org/docs/current/sql-grant.html
https://www.postgresql.org/docs/current/ddl-rowsecurity.html


=== Elasticsearch Engine ===
=== Elasticsearch Engine ===


A seperate elastic search engine kept in sync with, but independent from the sql server, will replace Sphinxsearch which queries the mysql db
Elasticsearch is a NoSQL DB with a focus on search functionality. The existing archiver stack search system, based on Sphinxsearch (running on the archive's MySQL DB), has frequently been a limiting factor for various 4chan archives (slowdowns, search disabled on certain boards etc.) It is believed that moving the search system to Elasticsearch could alleviate some performance issues. It would also theoretically allow for horizontal scaling of search with increasing post volume, and to split archive functions across globally-distributed hardware resources, among other benefits.
 
An ongoing effort towards a "greenfield"-style Elasticsearch-based archiving project is hosted in these repos for a  [https://gitgud.io/baystdev/chan-scraper-es-node scraper] and [https://gitgud.io/desuarchive/chansearch search stuff].
 
==== Options ====
 
* Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, but only use it to return matching post IDs
** May be the easiest option
** Would be business as usual for non-search DB operations
** Would require full in-memory duplication of contents in SQL DB and Elasticsearch (ES synced with MySQL)
** Retain original post data retention/backup scheme, durability
* Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, and use it to return entire documents/posts
** Might be more intrusive than the above
** Would still be business as usual for non-search DB operations
** Would still require full duplication of contents in SQL DB and Elasticsearch on-disk, but post contents may not need to be held in-memory on the SQL side
** Retain original post data retention/backup scheme, durability
* Replace the existing SQL DB completely with Elasticsearch
** Possible long-term option, not the best to serve immediate needs
** Would require rewrites of any DB-facing component, not just search
** Needs new data retention/backup scheme
 
==== Plan ====
 
Determine which of the two "Elasticsearch to complement existing SQL DB" options is more practical in the short term, develop a proof of concept. Run modded Foolfuuka, unmodded Asagi on a couple low-traffic boards with the new Elasticsearch layer. [Link to Kanban board etc.]

Revision as of 19:52, 6 September 2019

Ayase Imageboard Archival Standard (Ayase)

The Ayase Imageboard Archival Standard was produced by the Bibliotheca Anonoma to handle the ever growing operations of Desuarchive and RebeccaBlackTech, by completely discarding FoolFuuka/Asagi and starting over with new paradigms and modern languages and technologies. It is made upof

Reference Implementation

  • Operating System: CentOS/RHEL 8, or Ubuntu 18.04 under apparmor, or docker to be platform independent
  • Database: PostgreSQL
  • Scraper: Eve or Hayden (.NET C#)
  • Middleware: Ayase (Python PyPy)
  • Frontends: 4chan X (4plebs-x), Clover, iphone app

Specifications

Full Images

The Futabilly engine, which archives 8chan Infinity-Next/Vichan, is able to scrape and store images as sha256sum because this engine uses that as their checksumming algorithm. Unfortunately, 4chan uses MD5sum which is sufficient for their short term storage needs, so we would have to do a conversion to sha256sum, but it is very worth it given the collision risks and the massive benefits of cross board deduplication.

Note that it is only possible to get a sha256sum of full images when doing full image scraping. Thumbs only scrapers will have to follow the procedure noted in the next section.

  • All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan.
    • This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, this does require migration of filenames to sha256sum and their calculation, but on powerful server hardware especially with hardware acceleration, it should not be feared. More on md5 collisions as they relate to 4Chan and its archives: Ayase/MD5_Collisions
    • The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. By that point if it is cracked there will be more issues than mere collisions to worry about.
    • While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped.
    • Even without hardware acceleration chips, sha256sums will still be quicker to generate than files can be read from disk, even SSDs. Even if a mass conversion of filenames has to be done, it would only take 4 days to deal with 30TB, though SSD caching might be necessary.
  • Image files are to be stored in double nested folders based on characters in the filename, as seen in Futabilly. This therefore provides a near perfect random distribution where each folder is on average the same size.
    • This could allow partitioning of the storage between each alphanumeric range, or if even necessary at huge sizes, single servers.
    • There could exist an nginx config on the server s1, where the scraper is located. We first have the user start accessing s1, which checks if the sha256sum is on local disk (maybe checking a redis index to avoid disk i/o).
    • if it is not found, then the nginx config redirects sha256sum ranges 0-5 to s2, a-g to s3, h-l to s4, etc. These ranges can be modified as they get copied around or rebalanced.
  • sha256sum should be appended as an extra value to each post right next to md5sum. Then there should be an images table with a SERIAL primary key, sha256sum as unique key, and a foreign key for the datastore location that the specific image is currently in.
    • While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest.
    • The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though).

Thumbs Only Scraping

When thumbs only scraping is used, it is not possible to record the sha256sum of the target image, so the sha256 key in posts is left undefined.

Nevertheless, sha256sum for thumbnails should always be generated and stored under the key sha256t. This checksum refers to the thumbnails own sha256sum filename and is unrelated to the image filename.

Third party 4chan clients wishing to add support for the 4chan archives would therefore use sha256t + .jpg as the thumb filename instead of tim + s.jpg. To provide backwards compatibility though, the foolfuuka reference style desuarchive.org/thumb/1234/56/123456789s.jpg can be used to resolve to whatever the file is currently named, whether under the asagi way or the sha256sum way. This is to allow seamless support while images get renamed, and ensure that urls never break.

Time

  • Ayase requires time to be stored in PostgreSQL datetimes, which also store timezones.
  • Only UTC should be used as the timezone for newly scraped data. The timezone support is not an excuse to store in other timezones.
  • The timezone support is only meant for compatibility purposes with prior Asagi data, given that they store time as US time (maybe Eastern) due to their past HTML scraping.
  • However, Future scrapes are strongly advised not to store any timezone other than UTC, local time should be up to the frontend or really the user's browser/app to determine.

Posts

  • In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead stored unescaped after processing by the function
    this.cleanSimple(text);
    This may have been due to its historical use as an HTML scraper. Whether we should continue to replicate this is an open question as it does not seem to have major data loss and should still be compatible as output, but my opinion is that especially as it will be put into json, we should escape it properly so that the JSON engine does not have to do it by itself.

PostgreSQL JSONB Schema

if we GET json from the 4chan API, and always serve the same json to the user, why deconstruct and reconstruct into post focused sql records every time?

JSONB is different from text blobs of JSON too, its more like a NoSQL database and can be indexed. PostgreSQL is a better NoSQL than MongoDB as The Guardian has found.

Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added.

PostgreSQL RBAC Row Permission System

Unlike most sites, on the 4chan archives ghostposts are anonymous, so the vast majority of users do not have accounts.

Essentially, the only users that have accounts on our archives are janitors, moderators, and admins. Therefore, row level access with the RBAC could be issued so that on the accounts and permissions table, a user can only access their own row, which is then used as Role Based permission policies restricting their read/write permissions on the rest of the database (janitors can only read and report, moderators can delete posts, admins can take full actions).

We don't exclude the possibility of having a larger userbase, perhaps with premium users able to issue GraphQL queries with fewer limits or access special ranges of data, but using PostgreSQL RBAC for that is not a bad idea either. While it might sound like lunacy to issue SQL accounts to users, better to secure at the PostgreSQL database level (of which the RBAC is very mature) than giving full permissions to a user used by the API, which then haphazardly issues permissions on its own and introduces exploits, as it is normally done.

Kubernetes uses PostgreSQL RBAC successfully in production as seen here: https://kubedb.com/docs/0.11.0/guides/postgres/quickstart/rbac/

Official Documentation:

https://www.postgresql.org/docs/current/user-manag.html

https://www.postgresql.org/docs/current/sql-grant.html

https://www.postgresql.org/docs/current/ddl-rowsecurity.html

Elasticsearch Engine

Elasticsearch is a NoSQL DB with a focus on search functionality. The existing archiver stack search system, based on Sphinxsearch (running on the archive's MySQL DB), has frequently been a limiting factor for various 4chan archives (slowdowns, search disabled on certain boards etc.) It is believed that moving the search system to Elasticsearch could alleviate some performance issues. It would also theoretically allow for horizontal scaling of search with increasing post volume, and to split archive functions across globally-distributed hardware resources, among other benefits.

An ongoing effort towards a "greenfield"-style Elasticsearch-based archiving project is hosted in these repos for a scraper and search stuff.

Options

  • Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, but only use it to return matching post IDs
    • May be the easiest option
    • Would be business as usual for non-search DB operations
    • Would require full in-memory duplication of contents in SQL DB and Elasticsearch (ES synced with MySQL)
    • Retain original post data retention/backup scheme, durability
  • Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, and use it to return entire documents/posts
    • Might be more intrusive than the above
    • Would still be business as usual for non-search DB operations
    • Would still require full duplication of contents in SQL DB and Elasticsearch on-disk, but post contents may not need to be held in-memory on the SQL side
    • Retain original post data retention/backup scheme, durability
  • Replace the existing SQL DB completely with Elasticsearch
    • Possible long-term option, not the best to serve immediate needs
    • Would require rewrites of any DB-facing component, not just search
    • Needs new data retention/backup scheme

Plan

Determine which of the two "Elasticsearch to complement existing SQL DB" options is more practical in the short term, develop a proof of concept. Run modded Foolfuuka, unmodded Asagi on a couple low-traffic boards with the new Elasticsearch layer. [Link to Kanban board etc.]