Editing Ayase
From Bibliotheca Anonoma
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 10: | Line 10: | ||
* Operating System: Any Linux system on any architecture supported by Rust and Python. | * Operating System: Any Linux system on any architecture supported by Rust and Python. | ||
* Database: | * Database: TimescaleDB/PostgreSQL - As TimescaleDB is a superset of PostgreSQL, for smaller scale deployments it is also fully PostgreSQL compatible. | ||
* Middleware/HTML Frontend: [https://github.com/bibanon/ayase Ayase] (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers. | * Middleware/HTML Frontend: [https://github.com/bibanon/ayase Ayase] (Python, FastAPI, Jinja2 HTML templates) - An Ayase and Asagi schema compatible frontend system for viewing the databases created by these scrapers. | ||
* Scraper: | * Scraper: | ||
Line 67: | Line 65: | ||
** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. | ** While sha256sum cannot necessarily be made into a foreign key in JSONB, finding it from the images table just a matter of putting it back into the search query and having the index do the rest. | ||
** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). | ** The SERIAL primary key is used even though sha256sum is the real key used as seen from its unique constraint, because it also doubles as a record of when the file was scraped and added to the database. This is necessary for incremental backup systems brought straight over from Asagi (still to be made though). | ||
=== Thumbs Only Scraping === | === Thumbs Only Scraping === | ||
Line 203: | Line 199: | ||
=== Posts === | === Posts === | ||
* In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] | * In Asagi: posts are not made up of the raw data from the 4chan API, which includes HTML escapes. They are instead [https://github.com/bibanon/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java#L90 stored unescaped after processing by the function <pre>this.cleanSimple(text);</pre>] This may have been due to its historical use as an HTML scraper. Whether we should continue to replicate this is an open question as it does not seem to have major data loss and should still be compatible as output, but my opinion is that especially as it will be put into json, we should escape it properly so that the JSON engine does not have to do it by itself. | ||
This may have been due to its historical use as an HTML scraper. | |||
Since cleaning html by regex is a lossy conversion, best to leave asagi posts the way they are cleaned, but all future posts just leave it verbatim from the 4chan api, with only simple security html tag removals if at all necessary (4chan does some already). | |||
=== PostgreSQL JSONB Schema === | === PostgreSQL JSONB Schema === | ||
Line 228: | Line 210: | ||
Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. | Another thing is that maybe we shouldn't have separate tables for every board like Asagi currently does. If Reddit or 8chan's Infinity platform was getting archived by this, it would be impractical to operate. While having a single table sounds like lunacy as well, PostgreSQL allows tables to be partitioned based on a single column, so an additional `board` column can be added. | ||
=== Single Table without Side Tables or Triggers === | === Single Table without Side Tables or Triggers === | ||
Line 348: | Line 320: | ||
* idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames | * idx__media_hash__num - used to find first occurrence of media_hash for canonical filenames | ||
* idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames | * idx__media_hash_src__num - used to find first occurrence of media_hash_src for canonical filenames | ||
==== Other Notes ==== | ==== Other Notes ==== | ||
'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible | 'replies' column must be clamped to u16 max when counting, since 64k+ replies are possible | ||
=== PostgreSQL RBAC Row Permission System === | === PostgreSQL RBAC Row Permission System === |