Ayase: Difference between revisions
Antonizoon (talk | contribs) (→Files) Tags: mobile edit mobile web edit |
Antonizoon (talk | contribs) No edit summary |
||
(27 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Ayase Imageboard Archival Standard (Ayase) = | = Ayase Imageboard Archival Standard (Ayase) = | ||
The Ayase Imageboard Archival Standard was produced by the Bibliotheca Anonoma to handle the ever growing operations of Desuarchive and RebeccaBlackTech, by completely discarding [[FoolFuuka/Asagi]] and starting over with new paradigms and modern languages and technologies. | The Ayase Imageboard Archival Standard was produced by the Bibliotheca Anonoma to handle the ever growing operations of Desuarchive and RebeccaBlackTech, by completely discarding [[FoolFuuka/Asagi]] and starting over with new paradigms and modern languages and technologies. | ||
== Reference Implementation == | == Reference Implementation == | ||
* Operating System: | Note that an Ayase schema is still in the drafting stages, but the schema defined in ena is the aim. | ||
* | |||
* Scraper: | Currently the namesake Ayase middleware as of this writing is tuned for the Asagi schema in the aim of quickly eliminating the need for FoolFuuka, but can add support for a new Ayase schema later. | ||
* | |||
* Operating System: Any Linux system on any architecture supported by Rust and Python. | |||
* Database: PostgreSQL based - Tenatively due to the end of support of TokuDB and the awful gotchas in mariadb, we should consider Postgresql. | |||
* TimescaleDB/PostgreSQL - As TimescaleDB is a superset of PostgreSQL, for smaller scale deployments it is also fully PostgreSQL compatible. | |||
* CockroachDB - Seems better for horizontal scaling and decentralized resilience, but the whole dataset is only 1tb might not yet be needed. | |||
* Middleware/HTML Frontend: [https://github.com/bibanon/ayase Ayase] (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers. | |||
* Scraper: | |||
*# [https://github.com/shiimizu/ena Ena] (Rust) - The flagship scraper of the Ayase schema. Also has drop in Asagi scraping capabilities. | |||
*# [https://github.com/bbepis/Hayden Hayden] (.NET C#) - Asagi drop in scraper, best used in Windows servers. It is able to operate at small scales. | |||
* API Frontends: 4chan X ([https://github.com/pleebe/4plebs-x 4plebs-x]), Clover, iphone app | |||
== Specifications == | == Specifications == | ||
=== IP Address Storage === | |||
MySQL uses an odd counterintuitive format to more efficiently store IP addresses. It stores them as an integer using the function INET_ATON, then the INET_NTOA function is used to turn them back into the more conventionally understood octals. | |||
In PostgreSQL it provides a type for storing IP addresses, but transparently displays to octal or with CIDR so you'd never have this issue. | |||
https://www.postgresql.org/docs/current/datatype-net-types.html | |||
=== Definition of an Image === | |||
It is important to hash the right thing when we process images, as various encoding differences can give the pixel perfect, exact same lossless bitmap a different hash. | |||
Python PIL in particular therefore decodes an image to its pixel perfect form with image.open before we do processing on it. | |||
An image can also have various modes: Greyscale, RGB, and Cyan. So to avoid all that and handle alpha transparency, we convert the bitmap convert('rgba') | |||
So in Python PIL, we define an image as follows: | |||
<pre> | |||
from PIL import Image | |||
i = Image.open(fn) | |||
hashes['sha3i'] = sha3_256(i.convert('RGBA').tobytes()).digest() | |||
</pre> | |||
Find an equivalent for other languages. | |||
=== Full Images === | === Full Images === | ||
Line 20: | Line 56: | ||
* All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan. | * All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan. | ||
** This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, this does require migration of filenames to sha256sum and their calculation, but on powerful server hardware especially with hardware acceleration, it should not be feared. | ** This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, this does require migration of filenames to sha256sum and their calculation, but on powerful server hardware especially with hardware acceleration, it should not be feared. More on md5 collisions as they relate to 4Chan and its archives: [[Ayase/MD5_Collisions]] | ||
** The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. By that point if it is cracked there will be more issues than mere collisions to worry about. | ** The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. By that point if it is cracked there will be more issues than mere collisions to worry about. | ||
** While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped. | ** While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped. | ||
Line 31: | Line 67: | ||
** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. | ** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. | ||
** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). | ** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). | ||
The question of radix directories should be considered. Since gitlab does 2 characters/2characters without a problem even at large scale, at least thats a proven standard to follow. However, some archives prefer 2 characters/1, it makes less directories. | |||
=== Thumbs Only Scraping === | === Thumbs Only Scraping === | ||
Line 46: | Line 84: | ||
* The timezone support is ''only meant for compatibility purposes with prior Asagi data'', given that they store time as US time (maybe Eastern) due to their past HTML scraping. | * The timezone support is ''only meant for compatibility purposes with prior Asagi data'', given that they store time as US time (maybe Eastern) due to their past HTML scraping. | ||
* ''However, Future scrapes are strongly advised not to store any timezone other than UTC'', local time should be up to the frontend or really the user's browser/app to determine. | * ''However, Future scrapes are strongly advised not to store any timezone other than UTC'', local time should be up to the frontend or really the user's browser/app to determine. | ||
One issue with converting to PostgreSQL timezones is that the legal definition of time for a city is not as simple as a single component timezone such as EST, as it sometimes becomes EDT for months and then switches back. Instead, the GNU tz time locale America_NewYork is used. | |||
Another problem is that the conversion asagi did is lossy, it does not store whether the unix timestamp refers to EST or EDT. This means that special scripting to determine the proper timezone based on whether the calendar date is legally a time change day: whenever the next post would have fallen back an hour, that indicates a daylight savings time change. Next, inference from the post number order context (less accurate but only small edge cases) will be necessary for time changes forward to ensure that the hour forward is not due to 4chan being down for a period. | |||
Therefore any conversions will need to carefully consider this by checking the true legal timezone for the tz locale America_NewYork on the calendar date when storing a timezone. | |||
https://github.com/eksopl/asagi/blob/e0afce64c925360f2ddd9eea83d4b8f2063cd49c/src/main/java/net/easymodo/asagi/YotsubaJSON.java#L195 | |||
==== Bakugo's Proposed and Tested Solution ==== | |||
Bakugo has distilled the logic for converting the Asagi New York timestamp into UTC with the following Python example, which is part of a loop that goes through all posts in the table to convert them. | |||
<pre> | |||
def convert_timestamp(a): | |||
a = datetime.datetime.utcfromtimestamp(a) | |||
a = pytz.timezone("America/New_York").localize(a, is_dst=True) | |||
a = int(a.timestamp()) | |||
return a | |||
</pre> | |||
Bakugo has an additional script to untangle the DST overlap: when a Daylight Savings Change has occurred which results in spring forwards or fall backs. | |||
<pre> | |||
timestamp = int(row["timestamp"]) | |||
timestamp = convert_timestamp(timestamp) | |||
# did timestamp go back roughly one hour? | |||
# if so, a timezone change might have happened | |||
# but we need to make sure it's not a moved thread | |||
if ( | |||
(timelast - timestamp) > 60*10 and | |||
(timelast - timestamp) < 60*70 | |||
): | |||
skip = False | |||
tmp1 = datetime.datetime.utcfromtimestamp(timestamp) | |||
# can only happen in october or november | |||
# https://www.timeanddate.com/time/zone/usa/new-york | |||
if ( | |||
tmp1.month < 10 or | |||
tmp1.month > 11 | |||
): | |||
skip = True | |||
# we already determined that this is a moved thread | |||
if int(row["thread_num"]) in moved: | |||
skip = True | |||
if not skip: | |||
if int(row["op"]): | |||
tmp = int(row["thread_num"]) | |||
cursor.execute("SELECT num FROM {} WHERE thread_num = {} ORDER BY num ASC LIMIT 5".format(table_src, tmp)) | |||
rows1 = cursor.fetchall() | |||
if len(rows1) < 5: | |||
skip = True | |||
continuous = True | |||
for row1 in rows1: | |||
if int(row1["num"]) != tmp: | |||
continuous = False | |||
break | |||
tmp = tmp + 1 | |||
continue | |||
if continuous: | |||
skip = True | |||
moved.append(int(row["thread_num"])) | |||
# print("thread moved", row["num"]) | |||
if not skip: | |||
timejump = timestamp | |||
timelast = timestamp | |||
print("timezone jump", row["num"], timestamp) | |||
else: | |||
timelast = timestamp | |||
# did a timezone change happen recently? | |||
if ( | |||
timejump and | |||
timestamp >= timejump and | |||
(timestamp - timejump) < (60*60*24) | |||
): | |||
tmp1 = datetime.datetime.utcfromtimestamp(timejump) | |||
tmp2 = datetime.datetime.utcfromtimestamp(timestamp) | |||
# does the change hour correspond to this hour? | |||
if ( | |||
tmp1.year == tmp2.year and | |||
tmp1.month == tmp2.month and | |||
tmp1.day == tmp2.day and | |||
tmp1.hour == tmp2.hour | |||
): | |||
# advance timestamp by one hour to correct it | |||
timestamp = (timestamp + (60*60*1)) | |||
</pre> | |||
https://wiki.bibanon.org/Ayase#Bakugo.27s_Proposed_and_Tested_Solution | |||
FYI the above function posted to the wiki converts the asagi timestamps back to utc, not the other way around yes, that wikipage of logic I posted is to handle the dst overlap. that logic is part of a loop that goes through all posts in the table to convert them | |||
<pre> | |||
def asagi_timestamp_conv(x): | |||
x = datetime.datetime.utcfromtimestamp(x) | |||
x = x.astimezone(pytz.timezone("America/New_York")) | |||
x = x.replace(tzinfo=pytz.timezone("UTC")) | |||
x = int(x.timestamp()) | |||
return x | |||
</pre> | |||
=== Posts === | === Posts === | ||
* In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] This may have been due to its historical use as an HTML scraper. | * In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] Regexes are then done to process only certain html into bbcode tags. | ||
This may have been due to its historical use as an HTML scraper. | |||
Since cleaning html by regex is a lossy conversion, it needs to be decided whether to leave asagi posts as is, or at least convert the bbcode back into simulations of contemporary 4chan HTML. | |||
* We should probably reverse all asagi regexes that converted certain 4chan html to bbcode. | |||
* Note that neofuuka scrapes also store the raw unprocessed 4chan html as "comment" in the exif field. | |||
For all future posts at least just leave it verbatim from the 4chan api, dont even bother searching for unsafe html (4chan processes this already and any removal of unsafe html should be either a case by case basis or postprocessing by frontend). | |||
* Leave the exif table as is. Have postprocessing convert it into a json if necessary. | |||
* the oekaki replay is just a file that is uploaded next to the image it's not particularly hard to scrape. | |||
The proposed python code for converting bbcode back to their original 4chan html tags is as follows: | |||
https://gist.github.com/bakugo/1545c39231d6ed10226b4600d7270681 | |||
=== PostgreSQL JSONB Schema === | === PostgreSQL JSONB Schema === | ||
Line 58: | Line 228: | ||
Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. | Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. | ||
=== PostgreSQL Optimizations === | |||
To improve performance, PostgreSQL may need a lot more tuning than Mariadb. | |||
The basics if you dont do anything else: | |||
https://bun.uptrace.dev/postgres/performance-tuning.html | |||
You also want to cluster tables and vacuum full manually every now and then. And you want to disable autovacuum when migrating data. | |||
=== Single Table without Side Tables or Triggers === | |||
One of the biggest drawbacks of the Asagi and FoolFuuka MySQL schemas was their heavy use of side tables that duplicate the data in the main table, and triggers to keep that data in sync. This means that a single INSERT query becomes ten or more INSERTs and UPDATEs, reducing performance immensely and increasing redundancy in data storage. | |||
Bakugo proposes a schema that unifies the Asagi schema into a single table, and reduces the drawbacks of losing the side tables. He also renamed the fields to more intuitive names. | |||
<pre> | |||
CREATE TABLE `a` ( | |||
`id` int(10) unsigned NOT NULL AUTO_INCREMENT, | |||
`num` int(10) unsigned NOT NULL, | |||
`thread` int(10) unsigned NOT NULL, | |||
`opener` tinyint(3) unsigned NOT NULL, | |||
`time_posted` int(10) unsigned NOT NULL, | |||
`time_modified` int(10) unsigned NOT NULL, | |||
`time_deleted_post` int(10) unsigned DEFAULT NULL, | |||
`time_deleted_file` int(10) unsigned DEFAULT NULL, | |||
`time_archived` int(10) unsigned DEFAULT NULL, | |||
`time_bumped` int(10) unsigned DEFAULT NULL, | |||
`name` varchar(1000) DEFAULT NULL, | |||
`tripcode` varchar(50) DEFAULT NULL, | |||
`capcode` varchar(50) DEFAULT NULL, | |||
`posterid` varchar(50) DEFAULT NULL, | |||
`country` varchar(50) DEFAULT NULL, | |||
`email` varchar(1000) DEFAULT NULL, | |||
`subject` varchar(1000) DEFAULT NULL, | |||
`comment` text DEFAULT NULL, | |||
`media_time` bigint(20) unsigned DEFAULT NULL, | |||
`media_hash` varbinary(128) DEFAULT NULL, | |||
`media_hash_src` varbinary(128) DEFAULT NULL, | |||
`media_hash_thb` varbinary(128) DEFAULT NULL, | |||
`media_name` varchar(1000) DEFAULT NULL, | |||
`media_type` varchar(10) DEFAULT NULL, | |||
`media_size` int(10) unsigned DEFAULT NULL, | |||
`media_dims_src_w` smallint(5) unsigned DEFAULT NULL, | |||
`media_dims_src_h` smallint(5) unsigned DEFAULT NULL, | |||
`media_dims_thb_w` smallint(5) unsigned DEFAULT NULL, | |||
`media_dims_thb_h` smallint(5) unsigned DEFAULT NULL, | |||
`spoiler` tinyint(3) unsigned NOT NULL, | |||
`sticky` tinyint(3) unsigned DEFAULT NULL, | |||
`closed` tinyint(3) unsigned DEFAULT NULL, | |||
`posters` smallint(5) unsigned DEFAULT NULL, | |||
`replies` smallint(5) unsigned DEFAULT NULL, | |||
`hidden` tinyint(3) unsigned NOT NULL, | |||
`extra` text DEFAULT NULL, | |||
PRIMARY KEY (`id`), | |||
UNIQUE KEY `idx__num` (`num`), | |||
UNIQUE KEY `idx__thread__num` (`thread`,`num`), | |||
UNIQUE KEY `idx__opener__num` (`opener`,`num`), | |||
UNIQUE KEY `idx__media_hash__num` (`media_hash`,`num`), | |||
UNIQUE KEY `idx__media_hash_src__num` (`media_hash_src`,`num`), | |||
KEY `idx__opener__time_bumped` (`opener`,`time_bumped`), | |||
KEY `idx__time_posted` (`time_posted`), | |||
KEY `idx__time_modified` (`time_modified`), | |||
KEY `idx__media_hash` (`media_hash`), | |||
KEY `idx__media_time` (`media_time`) | |||
) ENGINE=TokuDB DEFAULT CHARSET=utf8mb4 COMPRESSION=TOKUDB_ZLIB | |||
</pre> | |||
==== Fields ==== | |||
* id - int, auto increment id, necessary for search | |||
* num - int, post number | |||
* thread - int, thread number (is never 0) | |||
* opener - bool, if this is an OP post, same as (num = thread) but necessary for sorting | |||
* time_posted - unix timestamp, time the post was made, provided by 4chan | |||
* time_modified - unix timestamp, time the post was last modified, provided by scraper | |||
* time_deleted_post - unix timestamp, time the post was deleted, provided by scraper | |||
* time_deleted_file - unix timestamp, time the post's file was deleted, provided by scraper | |||
* time_archived - unix timestamp, opener only, time the thread was archived, provided by 4chan usually | |||
* time_bumped - unix timestamp, opener only, max time_posted of this thread's replies | |||
* name - string, poster name | |||
* tripcode - string, poster tripcode | |||
* capcode - string, poster capcode | |||
* posterid - string, poster unique id | |||
* country - string, poster country (may include a prefix, such as "troll:nz") | |||
* email - string, poster email (legacy field, new posts will never have it) | |||
* subject - string, thread subject (old replies can also have subjects!) | |||
* comment - string, post comment (original html with very slight cleanups) | |||
* media_time - int, media timestamp, miliseconds (from 4chan) | |||
* media_hash - binary, media src hash md5, provided by 4chan | |||
* media_hash_src - binary, media src hash sha256/etc, calculated by scraper | |||
* media_hash_thb - binary, media thb hash sha256/etc, calculated by scraper | |||
* media_name - string, media original filename (no extension) | |||
* media_type - string, media extension | |||
* media_size - int, media filesize in bytes | |||
* media_dims_src_w - int, media src width in pixels | |||
* media_dims_src_h - int, media src height in pixels | |||
* media_dims_thb_w - int, media thb width in pixels | |||
* media_dims_thb_h - int, media thb height in pixels | |||
* spoiler - bool, was post marked as spoiler (not inherently linked to media!) | |||
* sticky - bool, opener only, was thread stickied | |||
* closed - bool, opener only, was thread closed | |||
* posters - int, max number of unique posters, provided by 4chan | |||
* replies - int, opener only, total number of replies in this thread, counted by sql, doesn't include op post | |||
* hidden - bool, if this post has been hidden from public view by archive admin | |||
* extra - string, json of any extra fields required | |||
==== Indexes ==== | |||
* idx__num - obvious | |||
* idx__thread__num - used to get all replies of a thread | |||
* idx__opener__num - used to get index threads sorted by number | |||
* idx__opener__time_bumped - used to get index threads sorted by bump | |||
* idx__time_posted - used to find/order posts by creation time | |||
* idx__time_modified - used to find/order posts by modification time | |||
* idx__media_hash - used to find posts by media_hash | |||
* idx__media_time - used to find posts by media_time | |||
* idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames | |||
* idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames | |||
==== Other Notes ==== | |||
'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible | |||
Do not store empty strings or zeros where a defsult value of NULL would work better. Mitsuba is currently saving 0s instead of nulls. (Bakugo: "using nullable fields saves space right? Everyone can agree on that in most cases") | |||
Also, adding more null fields even for fields that are rarely used has no performance downside. You usually just want to explicitly exclude them from the index. | |||
So for example, for media_hashes, you don't want to index all of the null values, because they're never used in view same searching. | |||
So you CREATE INDEX post_media_internal_hash_index ON post(media_internal_hash, post_number) WHERE media_internal_hash IS NOT NULL; | |||
Which means you're indexing 1/5 of all posts instead of all of them. | |||
=== Alternative PostgreSQL Based Vision by Chakatan === | |||
This alternative vision uses PostgreSQL and implements various preferences of Chakatan. | |||
https://gist.github.com/oka-tan/f794a6ac464a09f3581d0ac530e37b92 | |||
== Page Views == | |||
The views we use to display data to users from the SQL database or search engine make up the primary mechanism that users interact with the system, with the API being for the most part, reflections of such views. | |||
All anonymous users are authorized only to read the database. A separate API and authentication system is required for admins and moderators to edit the content of the SQL Database through approved mechanisms. | |||
=== Pagination === | |||
Foolfuuka had an abysmal offset based pagination which put more and more stress on the SQL database the further back anons went. However, without a powerful search server on many archives, this was the only way for them to browse further back in the archives without knowing the thread or post number. | |||
* Pagination Methodology: Cursor Pagination - Although 4chan uses offset pagination to provide a set of handy page numbers, page numbers don't make sense on an archive as they are a relative marker that changes every time a new thread is added, and offset pagination becomes more inefficient as the user goes further back. | |||
* Thread Sort Order (Default: Last Modified) - The thread sort order should be last modified by default, as it is on 4chan and FoolFuuka. | |||
** An admin may elect to have the WebUI display sort by thread creation date instead. To ensure disambiguation, specifying the search ordering method is required in the API whichever the default is. | |||
== Queries == | |||
Queries are methods for users to request arbitrary content that is outside of the page views but doesn't resort to utilizing the search server. | |||
=== Post Number === | |||
There should be an option to find a specific post number whether the user knows the board it was on, or not (a frequent situation when looking up screencaps). | |||
Once submitted: | |||
# If a specific board was defined, it should send the user to https://archive.url/board_name/post/12345678 | |||
# If no specific board was defined (important for looking up a post number from a screencap), it should send the user to https://archive.url/post/12345678 | |||
#* The results should be a post on all boards that matches the post number, of which there may be multiple results or only one. | |||
Also, if the post number points to an OP post, it should direct the user to the thread URL instead of the post url. | |||
These URLs should be usable without having to enter them in the WebUI. | |||
== Experimental Concepts == | |||
These are experimental concepts which are not yet part of the Ayase standard. | |||
=== PostgreSQL RBAC Row Permission System === | === PostgreSQL RBAC Row Permission System === | ||
Line 80: | Line 426: | ||
Elasticsearch is a NoSQL DB with a focus on search functionality. The existing archiver stack search system, based on Sphinxsearch (running on the archive's MySQL DB), has frequently been a limiting factor for various 4chan archives (slowdowns, search disabled on certain boards etc.) It is believed that moving the search system to Elasticsearch could alleviate some performance issues. It would also theoretically allow for horizontal scaling of search with increasing post volume, and to split archive functions across globally-distributed hardware resources, among other benefits. | Elasticsearch is a NoSQL DB with a focus on search functionality. The existing archiver stack search system, based on Sphinxsearch (running on the archive's MySQL DB), has frequently been a limiting factor for various 4chan archives (slowdowns, search disabled on certain boards etc.) It is believed that moving the search system to Elasticsearch could alleviate some performance issues. It would also theoretically allow for horizontal scaling of search with increasing post volume, and to split archive functions across globally-distributed hardware resources, among other benefits. | ||
An ongoing effort towards a "greenfield"-style Elasticsearch-based archiving project is hosted in these repos for a [https://gitgud.io/baystdev/chan-scraper-es-node scraper] and [https://gitgud.io/desuarchive/chansearch search stuff]. | |||
==== Options ==== | ==== Options ==== |
Latest revision as of 01:31, 8 August 2022
Ayase Imageboard Archival Standard (Ayase)[edit]
The Ayase Imageboard Archival Standard was produced by the Bibliotheca Anonoma to handle the ever growing operations of Desuarchive and RebeccaBlackTech, by completely discarding FoolFuuka/Asagi and starting over with new paradigms and modern languages and technologies.
Reference Implementation[edit]
Note that an Ayase schema is still in the drafting stages, but the schema defined in ena is the aim.
Currently the namesake Ayase middleware as of this writing is tuned for the Asagi schema in the aim of quickly eliminating the need for FoolFuuka, but can add support for a new Ayase schema later.
- Operating System: Any Linux system on any architecture supported by Rust and Python.
- Database: PostgreSQL based - Tenatively due to the end of support of TokuDB and the awful gotchas in mariadb, we should consider Postgresql.
* TimescaleDB/PostgreSQL - As TimescaleDB is a superset of PostgreSQL, for smaller scale deployments it is also fully PostgreSQL compatible. * CockroachDB - Seems better for horizontal scaling and decentralized resilience, but the whole dataset is only 1tb might not yet be needed.
- Middleware/HTML Frontend: Ayase (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers.
- Scraper:
- API Frontends: 4chan X (4plebs-x), Clover, iphone app
Specifications[edit]
IP Address Storage[edit]
MySQL uses an odd counterintuitive format to more efficiently store IP addresses. It stores them as an integer using the function INET_ATON, then the INET_NTOA function is used to turn them back into the more conventionally understood octals.
In PostgreSQL it provides a type for storing IP addresses, but transparently displays to octal or with CIDR so you'd never have this issue.
https://www.postgresql.org/docs/current/datatype-net-types.html
Definition of an Image[edit]
It is important to hash the right thing when we process images, as various encoding differences can give the pixel perfect, exact same lossless bitmap a different hash.
Python PIL in particular therefore decodes an image to its pixel perfect form with image.open before we do processing on it.
An image can also have various modes: Greyscale, RGB, and Cyan. So to avoid all that and handle alpha transparency, we convert the bitmap convert('rgba')
So in Python PIL, we define an image as follows:
from PIL import Image i = Image.open(fn) hashes['sha3i'] = sha3_256(i.convert('RGBA').tobytes()).digest()
Find an equivalent for other languages.
Full Images[edit]
The Futabilly engine, which archives 8chan Infinity-Next/Vichan, is able to scrape and store images as sha256sum because this engine uses that as their checksumming algorithm. Unfortunately, 4chan uses MD5sum which is sufficient for their short term storage needs, so we would have to do a conversion to sha256sum, but it is very worth it given the collision risks and the massive benefits of cross board deduplication.
Note that it is only possible to get a sha256sum of full images when doing full image scraping. Thumbs only scrapers will have to follow the procedure noted in the next section.
- All files are to be named by sha256sum and file extension. This was chosen for the broad availability of hardware extensions in Intel/AMD/ARM for the purpose and its use by 8chan/vichan.
- This is to mitigate the ongoing issue of md5sum collisions becoming ever easier for users to do. However, this does require migration of filenames to sha256sum and their calculation, but on powerful server hardware especially with hardware acceleration, it should not be feared. More on md5 collisions as they relate to 4Chan and its archives: Ayase/MD5_Collisions
- The trustworthiness of the sha256sum is strong up to the dawn of quantum computing it seems. By that point if it is cracked there will be more issues than mere collisions to worry about.
- While there is a question of whether the ones generating the first sha256sum can be trusted, md5sums also will have this same issue as md5sum collisions can be generated and injected into the archive anytime. At least the fact that the file has the same sha256sum and md5sum at archival time should at least vouch for the fact that the archiver did not change it after a file was scraped.
- Even without hardware acceleration chips, sha256sums will still be quicker to generate than files can be read from disk, even SSDs. Even if a mass conversion of filenames has to be done, it would only take 4 days to deal with 30TB, though SSD caching might be necessary.
- Image files are to be stored in double nested folders based on characters in the filename, as seen in Futabilly. This therefore provides a near perfect random distribution where each folder is on average the same size.
- This could allow partitioning of the storage between each alphanumeric range, or if even necessary at huge sizes, single servers.
- There could exist an nginx config on the server s1, where the scraper is located. We first have the user start accessing s1, which checks if the sha256sum is on local disk (maybe checking a redis index to avoid disk i/o).
- if it is not found, then the nginx config redirects sha256sum ranges 0-5 to s2, a-g to s3, h-l to s4, etc. These ranges can be modified as they get copied around or rebalanced.
- sha256sum should be appended as an extra value to each post right next to md5sum. Then there should be an images table with a SERIAL primary key, sha256sum as unique key, and a foreign key for the datastore location that the specific image is currently in.
- While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest.
- The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though).
The question of radix directories should be considered. Since gitlab does 2 characters/2characters without a problem even at large scale, at least thats a proven standard to follow. However, some archives prefer 2 characters/1, it makes less directories.
Thumbs Only Scraping[edit]
When thumbs only scraping is used, it is not possible to record the sha256sum of the target image, so the sha256 key in posts is left undefined.
Nevertheless, sha256sum for thumbnails should always be generated and stored under the key sha256t. This checksum refers to the thumbnails own sha256sum filename and is unrelated to the image filename.
Third party 4chan clients wishing to add support for the 4chan archives would therefore use sha256t + .jpg as the thumb filename instead of tim + s.jpg. To provide backwards compatibility though, the foolfuuka reference style desuarchive.org/thumb/1234/56/123456789s.jpg can be used to resolve to whatever the file is currently named, whether under the asagi way or the sha256sum way. This is to allow seamless support while images get renamed, and ensure that urls never break.
Time[edit]
- Ayase requires time to be stored in PostgreSQL datetimes, which also store timezones.
- Only UTC should be used as the timezone for newly scraped data. The timezone support is not an excuse to store in other timezones.
- The timezone support is only meant for compatibility purposes with prior Asagi data, given that they store time as US time (maybe Eastern) due to their past HTML scraping.
- However, Future scrapes are strongly advised not to store any timezone other than UTC, local time should be up to the frontend or really the user's browser/app to determine.
One issue with converting to PostgreSQL timezones is that the legal definition of time for a city is not as simple as a single component timezone such as EST, as it sometimes becomes EDT for months and then switches back. Instead, the GNU tz time locale America_NewYork is used.
Another problem is that the conversion asagi did is lossy, it does not store whether the unix timestamp refers to EST or EDT. This means that special scripting to determine the proper timezone based on whether the calendar date is legally a time change day: whenever the next post would have fallen back an hour, that indicates a daylight savings time change. Next, inference from the post number order context (less accurate but only small edge cases) will be necessary for time changes forward to ensure that the hour forward is not due to 4chan being down for a period.
Therefore any conversions will need to carefully consider this by checking the true legal timezone for the tz locale America_NewYork on the calendar date when storing a timezone.
Bakugo's Proposed and Tested Solution[edit]
Bakugo has distilled the logic for converting the Asagi New York timestamp into UTC with the following Python example, which is part of a loop that goes through all posts in the table to convert them.
def convert_timestamp(a): a = datetime.datetime.utcfromtimestamp(a) a = pytz.timezone("America/New_York").localize(a, is_dst=True) a = int(a.timestamp()) return a
Bakugo has an additional script to untangle the DST overlap: when a Daylight Savings Change has occurred which results in spring forwards or fall backs.
timestamp = int(row["timestamp"]) timestamp = convert_timestamp(timestamp) # did timestamp go back roughly one hour? # if so, a timezone change might have happened # but we need to make sure it's not a moved thread if ( (timelast - timestamp) > 60*10 and (timelast - timestamp) < 60*70 ): skip = False tmp1 = datetime.datetime.utcfromtimestamp(timestamp) # can only happen in october or november # https://www.timeanddate.com/time/zone/usa/new-york if ( tmp1.month < 10 or tmp1.month > 11 ): skip = True # we already determined that this is a moved thread if int(row["thread_num"]) in moved: skip = True if not skip: if int(row["op"]): tmp = int(row["thread_num"]) cursor.execute("SELECT num FROM {} WHERE thread_num = {} ORDER BY num ASC LIMIT 5".format(table_src, tmp)) rows1 = cursor.fetchall() if len(rows1) < 5: skip = True continuous = True for row1 in rows1: if int(row1["num"]) != tmp: continuous = False break tmp = tmp + 1 continue if continuous: skip = True moved.append(int(row["thread_num"])) # print("thread moved", row["num"]) if not skip: timejump = timestamp timelast = timestamp print("timezone jump", row["num"], timestamp) else: timelast = timestamp # did a timezone change happen recently? if ( timejump and timestamp >= timejump and (timestamp - timejump) < (60*60*24) ): tmp1 = datetime.datetime.utcfromtimestamp(timejump) tmp2 = datetime.datetime.utcfromtimestamp(timestamp) # does the change hour correspond to this hour? if ( tmp1.year == tmp2.year and tmp1.month == tmp2.month and tmp1.day == tmp2.day and tmp1.hour == tmp2.hour ): # advance timestamp by one hour to correct it timestamp = (timestamp + (60*60*1))
https://wiki.bibanon.org/Ayase#Bakugo.27s_Proposed_and_Tested_Solution
FYI the above function posted to the wiki converts the asagi timestamps back to utc, not the other way around yes, that wikipage of logic I posted is to handle the dst overlap. that logic is part of a loop that goes through all posts in the table to convert them
def asagi_timestamp_conv(x): x = datetime.datetime.utcfromtimestamp(x) x = x.astimezone(pytz.timezone("America/New_York")) x = x.replace(tzinfo=pytz.timezone("UTC")) x = int(x.timestamp()) return x
Posts[edit]
- In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead stored unescaped after processing by the function
this.cleanSimple(text);
Regexes are then done to process only certain html into bbcode tags.
This may have been due to its historical use as an HTML scraper.
Since cleaning html by regex is a lossy conversion, it needs to be decided whether to leave asagi posts as is, or at least convert the bbcode back into simulations of contemporary 4chan HTML.
- We should probably reverse all asagi regexes that converted certain 4chan html to bbcode.
- Note that neofuuka scrapes also store the raw unprocessed 4chan html as "comment" in the exif field.
For all future posts at least just leave it verbatim from the 4chan api, dont even bother searching for unsafe html (4chan processes this already and any removal of unsafe html should be either a case by case basis or postprocessing by frontend).
- Leave the exif table as is. Have postprocessing convert it into a json if necessary.
- the oekaki replay is just a file that is uploaded next to the image it's not particularly hard to scrape.
The proposed python code for converting bbcode back to their original 4chan html tags is as follows:
https://gist.github.com/bakugo/1545c39231d6ed10226b4600d7270681
PostgreSQL JSONB Schema[edit]
if we GET json from the 4chan API, and always serve the same json to the user, why deconstruct and reconstruct into post focused sql records every time?
JSONB is different from text blobs of JSON too, its more like a NoSQL database and can be indexed. PostgreSQL is a better NoSQL than MongoDB as The Guardian has found.
Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added.
PostgreSQL Optimizations[edit]
To improve performance, PostgreSQL may need a lot more tuning than Mariadb.
The basics if you dont do anything else:
https://bun.uptrace.dev/postgres/performance-tuning.html
You also want to cluster tables and vacuum full manually every now and then. And you want to disable autovacuum when migrating data.
Single Table without Side Tables or Triggers[edit]
One of the biggest drawbacks of the Asagi and FoolFuuka MySQL schemas was their heavy use of side tables that duplicate the data in the main table, and triggers to keep that data in sync. This means that a single INSERT query becomes ten or more INSERTs and UPDATEs, reducing performance immensely and increasing redundancy in data storage.
Bakugo proposes a schema that unifies the Asagi schema into a single table, and reduces the drawbacks of losing the side tables. He also renamed the fields to more intuitive names.
CREATE TABLE `a` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `num` int(10) unsigned NOT NULL, `thread` int(10) unsigned NOT NULL, `opener` tinyint(3) unsigned NOT NULL, `time_posted` int(10) unsigned NOT NULL, `time_modified` int(10) unsigned NOT NULL, `time_deleted_post` int(10) unsigned DEFAULT NULL, `time_deleted_file` int(10) unsigned DEFAULT NULL, `time_archived` int(10) unsigned DEFAULT NULL, `time_bumped` int(10) unsigned DEFAULT NULL, `name` varchar(1000) DEFAULT NULL, `tripcode` varchar(50) DEFAULT NULL, `capcode` varchar(50) DEFAULT NULL, `posterid` varchar(50) DEFAULT NULL, `country` varchar(50) DEFAULT NULL, `email` varchar(1000) DEFAULT NULL, `subject` varchar(1000) DEFAULT NULL, `comment` text DEFAULT NULL, `media_time` bigint(20) unsigned DEFAULT NULL, `media_hash` varbinary(128) DEFAULT NULL, `media_hash_src` varbinary(128) DEFAULT NULL, `media_hash_thb` varbinary(128) DEFAULT NULL, `media_name` varchar(1000) DEFAULT NULL, `media_type` varchar(10) DEFAULT NULL, `media_size` int(10) unsigned DEFAULT NULL, `media_dims_src_w` smallint(5) unsigned DEFAULT NULL, `media_dims_src_h` smallint(5) unsigned DEFAULT NULL, `media_dims_thb_w` smallint(5) unsigned DEFAULT NULL, `media_dims_thb_h` smallint(5) unsigned DEFAULT NULL, `spoiler` tinyint(3) unsigned NOT NULL, `sticky` tinyint(3) unsigned DEFAULT NULL, `closed` tinyint(3) unsigned DEFAULT NULL, `posters` smallint(5) unsigned DEFAULT NULL, `replies` smallint(5) unsigned DEFAULT NULL, `hidden` tinyint(3) unsigned NOT NULL, `extra` text DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `idx__num` (`num`), UNIQUE KEY `idx__thread__num` (`thread`,`num`), UNIQUE KEY `idx__opener__num` (`opener`,`num`), UNIQUE KEY `idx__media_hash__num` (`media_hash`,`num`), UNIQUE KEY `idx__media_hash_src__num` (`media_hash_src`,`num`), KEY `idx__opener__time_bumped` (`opener`,`time_bumped`), KEY `idx__time_posted` (`time_posted`), KEY `idx__time_modified` (`time_modified`), KEY `idx__media_hash` (`media_hash`), KEY `idx__media_time` (`media_time`) ) ENGINE=TokuDB DEFAULT CHARSET=utf8mb4 COMPRESSION=TOKUDB_ZLIB
Fields[edit]
- id - int, auto increment id, necessary for search
- num - int, post number
- thread - int, thread number (is never 0)
- opener - bool, if this is an OP post, same as (num = thread) but necessary for sorting
- time_posted - unix timestamp, time the post was made, provided by 4chan
- time_modified - unix timestamp, time the post was last modified, provided by scraper
- time_deleted_post - unix timestamp, time the post was deleted, provided by scraper
- time_deleted_file - unix timestamp, time the post's file was deleted, provided by scraper
- time_archived - unix timestamp, opener only, time the thread was archived, provided by 4chan usually
- time_bumped - unix timestamp, opener only, max time_posted of this thread's replies
- name - string, poster name
- tripcode - string, poster tripcode
- capcode - string, poster capcode
- posterid - string, poster unique id
- country - string, poster country (may include a prefix, such as "troll:nz")
- email - string, poster email (legacy field, new posts will never have it)
- subject - string, thread subject (old replies can also have subjects!)
- comment - string, post comment (original html with very slight cleanups)
- media_time - int, media timestamp, miliseconds (from 4chan)
- media_hash - binary, media src hash md5, provided by 4chan
- media_hash_src - binary, media src hash sha256/etc, calculated by scraper
- media_hash_thb - binary, media thb hash sha256/etc, calculated by scraper
- media_name - string, media original filename (no extension)
- media_type - string, media extension
- media_size - int, media filesize in bytes
- media_dims_src_w - int, media src width in pixels
- media_dims_src_h - int, media src height in pixels
- media_dims_thb_w - int, media thb width in pixels
- media_dims_thb_h - int, media thb height in pixels
- spoiler - bool, was post marked as spoiler (not inherently linked to media!)
- sticky - bool, opener only, was thread stickied
- closed - bool, opener only, was thread closed
- posters - int, max number of unique posters, provided by 4chan
- replies - int, opener only, total number of replies in this thread, counted by sql, doesn't include op post
- hidden - bool, if this post has been hidden from public view by archive admin
- extra - string, json of any extra fields required
Indexes[edit]
- idx__num - obvious
- idx__thread__num - used to get all replies of a thread
- idx__opener__num - used to get index threads sorted by number
- idx__opener__time_bumped - used to get index threads sorted by bump
- idx__time_posted - used to find/order posts by creation time
- idx__time_modified - used to find/order posts by modification time
- idx__media_hash - used to find posts by media_hash
- idx__media_time - used to find posts by media_time
- idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames
- idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames
Other Notes[edit]
'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible
Do not store empty strings or zeros where a defsult value of NULL would work better. Mitsuba is currently saving 0s instead of nulls. (Bakugo: "using nullable fields saves space right? Everyone can agree on that in most cases")
Also, adding more null fields even for fields that are rarely used has no performance downside. You usually just want to explicitly exclude them from the index.
So for example, for media_hashes, you don't want to index all of the null values, because they're never used in view same searching.
So you CREATE INDEX post_media_internal_hash_index ON post(media_internal_hash, post_number) WHERE media_internal_hash IS NOT NULL;
Which means you're indexing 1/5 of all posts instead of all of them.
Alternative PostgreSQL Based Vision by Chakatan[edit]
This alternative vision uses PostgreSQL and implements various preferences of Chakatan.
https://gist.github.com/oka-tan/f794a6ac464a09f3581d0ac530e37b92
Page Views[edit]
The views we use to display data to users from the SQL database or search engine make up the primary mechanism that users interact with the system, with the API being for the most part, reflections of such views.
All anonymous users are authorized only to read the database. A separate API and authentication system is required for admins and moderators to edit the content of the SQL Database through approved mechanisms.
Pagination[edit]
Foolfuuka had an abysmal offset based pagination which put more and more stress on the SQL database the further back anons went. However, without a powerful search server on many archives, this was the only way for them to browse further back in the archives without knowing the thread or post number.
- Pagination Methodology: Cursor Pagination - Although 4chan uses offset pagination to provide a set of handy page numbers, page numbers don't make sense on an archive as they are a relative marker that changes every time a new thread is added, and offset pagination becomes more inefficient as the user goes further back.
- Thread Sort Order (Default: Last Modified) - The thread sort order should be last modified by default, as it is on 4chan and FoolFuuka.
- An admin may elect to have the WebUI display sort by thread creation date instead. To ensure disambiguation, specifying the search ordering method is required in the API whichever the default is.
Queries[edit]
Queries are methods for users to request arbitrary content that is outside of the page views but doesn't resort to utilizing the search server.
Post Number[edit]
There should be an option to find a specific post number whether the user knows the board it was on, or not (a frequent situation when looking up screencaps).
Once submitted:
- If a specific board was defined, it should send the user to https://archive.url/board_name/post/12345678
- If no specific board was defined (important for looking up a post number from a screencap), it should send the user to https://archive.url/post/12345678
- The results should be a post on all boards that matches the post number, of which there may be multiple results or only one.
Also, if the post number points to an OP post, it should direct the user to the thread URL instead of the post url.
These URLs should be usable without having to enter them in the WebUI.
Experimental Concepts[edit]
These are experimental concepts which are not yet part of the Ayase standard.
PostgreSQL RBAC Row Permission System[edit]
Unlike most sites, on the 4chan archives ghostposts are anonymous, so the vast majority of users do not have accounts.
Essentially, the only users that have accounts on our archives are janitors, moderators, and admins. Therefore, row level access with the RBAC could be issued so that on the accounts and permissions table, a user can only access their own row, which is then used as Role Based permission policies restricting their read/write permissions on the rest of the database (janitors can only read and report, moderators can delete posts, admins can take full actions).
We don't exclude the possibility of having a larger userbase, perhaps with premium users able to issue GraphQL queries with fewer limits or access special ranges of data, but using PostgreSQL RBAC for that is not a bad idea either. While it might sound like lunacy to issue SQL accounts to users, better to secure at the PostgreSQL database level (of which the RBAC is very mature) than giving full permissions to a user used by the API, which then haphazardly issues permissions on its own and introduces exploits, as it is normally done.
Kubernetes uses PostgreSQL RBAC successfully in production as seen here: https://kubedb.com/docs/0.11.0/guides/postgres/quickstart/rbac/
Official Documentation:
https://www.postgresql.org/docs/current/user-manag.html
https://www.postgresql.org/docs/current/sql-grant.html
https://www.postgresql.org/docs/current/ddl-rowsecurity.html
Elasticsearch Engine[edit]
Elasticsearch is a NoSQL DB with a focus on search functionality. The existing archiver stack search system, based on Sphinxsearch (running on the archive's MySQL DB), has frequently been a limiting factor for various 4chan archives (slowdowns, search disabled on certain boards etc.) It is believed that moving the search system to Elasticsearch could alleviate some performance issues. It would also theoretically allow for horizontal scaling of search with increasing post volume, and to split archive functions across globally-distributed hardware resources, among other benefits.
An ongoing effort towards a "greenfield"-style Elasticsearch-based archiving project is hosted in these repos for a scraper and search stuff.
Options[edit]
- Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, but only use it to return matching post IDs
- May be the easiest option
- Would be business as usual for non-search DB operations
- Would require full in-memory duplication of contents in SQL DB and Elasticsearch (ES synced with MySQL)
- Retain original post data retention/backup scheme, durability
- Add an Elasticsearch search layer to the existing stack, replacing Sphinxsearch, and use it to return entire documents/posts
- Might be more intrusive than the above
- Would still be business as usual for non-search DB operations
- Would still require full duplication of contents in SQL DB and Elasticsearch on-disk, but post contents may not need to be held in-memory on the SQL side
- Retain original post data retention/backup scheme, durability
- Replace the existing SQL DB completely with Elasticsearch
- Possible long-term option, not the best to serve immediate needs
- Would require rewrites of any DB-facing component, not just search
- Needs new data retention/backup scheme
Plan[edit]
Determine which of the two "Elasticsearch to complement existing SQL DB" options is more practical in the short term, develop a proof of concept. Run modded Foolfuuka, unmodded Asagi on a couple low-traffic boards with the new Elasticsearch layer. [Link to Kanban board etc.]