Latest revision |
Your text |
Line 10: |
Line 10: |
|
| |
|
| * Operating System: Any Linux system on any architecture supported by Rust and Python. | | * Operating System: Any Linux system on any architecture supported by Rust and Python. |
| * Database: PostgreSQL based - Tenatively due to the end of support of TokuDB and the awful gotchas in mariadb, we should consider Postgresql. | | * Database: TimescaleDB/PostgreSQL - As TimescaleDB is a superset of PostgreSQL, for smaller scale deployments it is also fully PostgreSQL compatible. |
| * TimescaleDB/PostgreSQL - As TimescaleDB is a superset of PostgreSQL, for smaller scale deployments it is also fully PostgreSQL compatible.
| |
| * CockroachDB - Seems better for horizontal scaling and decentralized resilience, but the whole dataset is only 1tb might not yet be needed.
| |
| * Middleware/HTML Frontend: [https://github.com/bibanon/ayase Ayase] (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers. | | * Middleware/HTML Frontend: [https://github.com/bibanon/ayase Ayase] (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers. |
| * Scraper: | | * Scraper: |
Line 28: |
Line 26: |
|
| |
|
| https://www.postgresql.org/docs/current/datatype-net-types.html | | https://www.postgresql.org/docs/current/datatype-net-types.html |
|
| |
| === Definition of an Image ===
| |
|
| |
| It is important to hash the right thing when we process images, as various encoding differences can give the pixel perfect, exact same lossless bitmap a different hash.
| |
|
| |
| Python PIL in particular therefore decodes an image to its pixel perfect form with image.open before we do processing on it.
| |
|
| |
| An image can also have various modes: Greyscale, RGB, and Cyan. So to avoid all that and handle alpha transparency, we convert the bitmap convert('rgba')
| |
|
| |
| So in Python PIL, we define an image as follows:
| |
|
| |
| <pre>
| |
| from PIL import Image
| |
|
| |
|
| |
| i = Image.open(fn)
| |
| hashes['sha3i'] = sha3_256(i.convert('RGBA').tobytes()).digest()
| |
| </pre>
| |
|
| |
| Find an equivalent for other languages.
| |
|
| |
|
| === Full Images === | | === Full Images === |
Line 67: |
Line 45: |
| ** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. | | ** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. |
| ** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). | | ** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). |
|
| |
| The question of radix directories should be considered. Since gitlab does 2 characters/2characters without a problem even at large scale, at least thats a proven standard to follow. However, some archives prefer 2 characters/1, it makes less directories.
| |
|
| |
|
| === Thumbs Only Scraping === | | === Thumbs Only Scraping === |
Line 187: |
Line 163: |
| # advance timestamp by one hour to correct it | | # advance timestamp by one hour to correct it |
| timestamp = (timestamp + (60*60*1)) | | timestamp = (timestamp + (60*60*1)) |
| </pre>
| |
|
| |
| https://wiki.bibanon.org/Ayase#Bakugo.27s_Proposed_and_Tested_Solution
| |
|
| |
| FYI the above function posted to the wiki converts the asagi timestamps back to utc, not the other way around yes, that wikipage of logic I posted is to handle the dst overlap. that logic is part of a loop that goes through all posts in the table to convert them
| |
| <pre>
| |
| def asagi_timestamp_conv(x):
| |
| x = datetime.datetime.utcfromtimestamp(x)
| |
| x = x.astimezone(pytz.timezone("America/New_York"))
| |
| x = x.replace(tzinfo=pytz.timezone("UTC"))
| |
| x = int(x.timestamp())
| |
| return x
| |
| </pre> | | </pre> |
|
| |
|
| === Posts === | | === Posts === |
|
| |
|
| * In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] Regexes are then done to process only certain html into bbcode tags. | | * In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] This may have been due to its historical use as an HTML scraper. Whether we should continue to replicate this is an open question as it does not seem to have major data loss and should still be compatible as output, but my opinion is that especially as it will be put into json, we should escape it properly so that the JSON engine does not have to do it by itself. |
|
| |
|
| This may have been due to its historical use as an HTML scraper.
| | Since cleaning html by regex is a lossy conversion, best to leave asagi posts the way they are cleaned, but all future posts just leave it verbatim from the 4chan api, with only simple security html tag removals if at all necessary (4chan does some already). |
| | |
| Since cleaning html by regex is a lossy conversion, it needs to be decided whether to leave asagi posts as is, or at least convert the bbcode back into simulations of contemporary 4chan HTML. | |
| | |
| * We should probably reverse all asagi regexes that converted certain 4chan html to bbcode.
| |
| * Note that neofuuka scrapes also store the raw unprocessed 4chan html as "comment" in the exif field.
| |
| | |
| For all future posts at least just leave it verbatim from the 4chan api, dont even bother searching for unsafe html (4chan processes this already and any removal of unsafe html should be either a case by case basis or postprocessing by frontend).
| |
| | |
| * Leave the exif table as is. Have postprocessing convert it into a json if necessary.
| |
| * the oekaki replay is just a file that is uploaded next to the image it's not particularly hard to scrape.
| |
| | |
| The proposed python code for converting bbcode back to their original 4chan html tags is as follows:
| |
| | |
| https://gist.github.com/bakugo/1545c39231d6ed10226b4600d7270681
| |
|
| |
|
| === PostgreSQL JSONB Schema === | | === PostgreSQL JSONB Schema === |
Line 228: |
Line 178: |
|
| |
|
| Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. | | Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. |
|
| |
| === PostgreSQL Optimizations ===
| |
|
| |
| To improve performance, PostgreSQL may need a lot more tuning than Mariadb.
| |
|
| |
| The basics if you dont do anything else:
| |
|
| |
| https://bun.uptrace.dev/postgres/performance-tuning.html
| |
|
| |
| You also want to cluster tables and vacuum full manually every now and then. And you want to disable autovacuum when migrating data.
| |
|
| |
|
| === Single Table without Side Tables or Triggers === | | === Single Table without Side Tables or Triggers === |
Line 348: |
Line 288: |
| * idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames | | * idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames |
| * idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames | | * idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames |
| | |
|
| |
|
| ==== Other Notes ==== | | ==== Other Notes ==== |
|
| |
|
| 'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible | | 'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible |
|
| |
| Do not store empty strings or zeros where a defsult value of NULL would work better. Mitsuba is currently saving 0s instead of nulls. (Bakugo: "using nullable fields saves space right? Everyone can agree on that in most cases")
| |
|
| |
| Also, adding more null fields even for fields that are rarely used has no performance downside. You usually just want to explicitly exclude them from the index.
| |
|
| |
| So for example, for media_hashes, you don't want to index all of the null values, because they're never used in view same searching.
| |
|
| |
| So you CREATE INDEX post_media_internal_hash_index ON post(media_internal_hash, post_number) WHERE media_internal_hash IS NOT NULL;
| |
|
| |
| Which means you're indexing 1/5 of all posts instead of all of them.
| |
|
| |
| === Alternative PostgreSQL Based Vision by Chakatan ===
| |
|
| |
| This alternative vision uses PostgreSQL and implements various preferences of Chakatan.
| |
|
| |
| https://gist.github.com/oka-tan/f794a6ac464a09f3581d0ac530e37b92
| |
|
| |
| == Page Views ==
| |
|
| |
| The views we use to display data to users from the SQL database or search engine make up the primary mechanism that users interact with the system, with the API being for the most part, reflections of such views.
| |
|
| |
| All anonymous users are authorized only to read the database. A separate API and authentication system is required for admins and moderators to edit the content of the SQL Database through approved mechanisms.
| |
|
| |
| === Pagination ===
| |
|
| |
| Foolfuuka had an abysmal offset based pagination which put more and more stress on the SQL database the further back anons went. However, without a powerful search server on many archives, this was the only way for them to browse further back in the archives without knowing the thread or post number.
| |
|
| |
| * Pagination Methodology: Cursor Pagination - Although 4chan uses offset pagination to provide a set of handy page numbers, page numbers don't make sense on an archive as they are a relative marker that changes every time a new thread is added, and offset pagination becomes more inefficient as the user goes further back.
| |
| * Thread Sort Order (Default: Last Modified) - The thread sort order should be last modified by default, as it is on 4chan and FoolFuuka.
| |
| ** An admin may elect to have the WebUI display sort by thread creation date instead. To ensure disambiguation, specifying the search ordering method is required in the API whichever the default is.
| |
|
| |
| == Queries ==
| |
|
| |
| Queries are methods for users to request arbitrary content that is outside of the page views but doesn't resort to utilizing the search server.
| |
|
| |
| === Post Number ===
| |
|
| |
| There should be an option to find a specific post number whether the user knows the board it was on, or not (a frequent situation when looking up screencaps).
| |
|
| |
| Once submitted:
| |
|
| |
| # If a specific board was defined, it should send the user to https://archive.url/board_name/post/12345678
| |
| # If no specific board was defined (important for looking up a post number from a screencap), it should send the user to https://archive.url/post/12345678
| |
| #* The results should be a post on all boards that matches the post number, of which there may be multiple results or only one.
| |
|
| |
| Also, if the post number points to an OP post, it should direct the user to the thread URL instead of the post url.
| |
|
| |
| These URLs should be usable without having to enter them in the WebUI.
| |
|
| |
| == Experimental Concepts ==
| |
|
| |
| These are experimental concepts which are not yet part of the Ayase standard.
| |
|
| |
|
| === PostgreSQL RBAC Row Permission System === | | === PostgreSQL RBAC Row Permission System === |