Ex.ua

Everything on ex.ua is an object with an ID.

Objects can be threads or posts. Threads contain posts. Both are in the same ID namespace. I think threads>posts do not recurse (as in there's only the one top level). I am not 1000% sure, but I haven't yet seen anything that contradicts this assumption.

By bruteforcing from 0 until you just keep getting 302 Moved's to / (for like 5000+ IDs or so), you will catch all threads and all posts in threads. Even though this fetches the full text of all posts, you will also need to handle pagination in order to preserve the semantic structure of what posts are in what threads and in what order they appear. You get <link rel="next" id="browse_next" href="..."> to the next page in a paginated thread; this entire node disappears on the last page of the thread. NOTE! Specify &per=200 (the max), this will get you the highest number of posts per page.

Posts can also have files attached to them. Any files attached to a post are visible as XML metadata in ex.ua/r_view/{id} (rover.info works too, just highlighting that this API call works on ex.ua and without a login). This API call only sees attached files, not forum comments or other information. I understand that the site has some HTML formatting options for posts, and this API call drops those, only preserving newlines (as s) and returning plaintext.

The site also has a /view_comments/{id} system which allows comments to be attached to a post. This appears to be an entirely different system; the HTML is noteworthily different, and the pagination is different. Detecting the last page is a bit trickier here and generally requires poking around for specific HTML.

Both pagination systems allow you to "overshoot" past the last page and will give you an empty page in that case. Detecting the empty page could be a reliable way to do pagination too.

Files are the simplest, and have a /get/... URL. These can be found inside the r_view links. /get/ links can be replaced with /torrent/ links and this may return a torrent file (this is not implemented for all files). Apparently there used to be a tracker.ex.ua. I'm not sure if the hashes in the torrent files are different from the md5sums the site delivers or if the torrent files return anything extra.

There's also the /user/ namespace. This does not use IDs. I downloaded the Wayback Machine's (*partial*) scrape of ex.ua and grepped the entire thing for "/user/[A-Za-z0-9_-]". I got just under 100k results back. That list is likely missing 5-10% users - a couple people posted some tiny userlists and those lists found users not in the wayback machine scrape. The merged list is attached.

The r_view API trick and crawling the IDs will find you other user accounts (via realtime regex matching) - but what you will need to do is make some kind of global user queue of users to download, so you download user profiles you don't have as you find them. I do not think the user pages use pagination, and are a single HTTP request.

THE ONE BIG POTENTIAL CAVEAT EMPTOR: Everything needs to be done with a login. The only cookie you need for this is "ukey". The potential issue here is that logged-in accounts are arguably easier to track and may have access ratelimits attached. I have no experience yet with what happens if rover.info is scraped at high speed with a login.

Side notes - user pages have avatars on them I'd want to fetch too, and r_view gives you a <picture ...> tag with the image associated with the post. In a lot of cases this is album art. If I have ACD, I definitely want to grab these. Also - images on the site accept a ?size parameter, eg ....jpg?1600 will get you the 1600px-wide version. Dropping the ?.... will get you the full-resolution version (which is generally what you want with 100TB available).


How I'd approach this:

I'm not experienced at HTML parsing. I don't know everything there is to know about this site, and I only have in less than 15 days. I would personally go for an approach of simply OM NOM NOMing the HTML from the server, saving it as-is, and then batch-parsing everything post-scrape. That way we have an acid test, we can manually inspect the conversion, etc etc.

This means that the only problems you need to solve NOW are pagination and userlists. That's it. I think this is viable to reason through and solve very quickly (I've been working to do this myself).

I am not sure of any other gotchas or things that would need to be done. If anybody sees anything that doesn't fit into the above please let me know.