Ayase/MD5 Collisions

In recent years, a variety of mechanisms for generating md5 collisions have been made practical and well-publicised. More recently still, practical methods have been found to exploit these exploits with media files. A good demonstration of how broken md5s are is animated "hashquines", which use md5 collisions to display the animated GIF's own md5 hash.

4Chan and its archives depend on md5 to a certain extent for identifying unique media files. 4Chan uses md5 in its spam detection process (and elsewhere), and Asagi-based archivers use the "uniqueness" property for deduplication.

= MD5 Vulnerabilities =

There are two forms of the MD5 collision exploit discovered so far: a "chosen-prefix" and an "identical-prefix" collision mechanism.
 * The "identical-prefix" style of exploit inserts "collision blocks" within otherwise identical files to generate md5 collisions. Files generated with this style of collision have been demonstrated to pass through any 4chan post-processing steps without alteration. 4chan and its archives are vulnerable to at least gif md5 collisions, and probably to exploits crafted for other file formats as well. For more information on identical-prefix collisions, see this explanation/example and this discussion on hash collisions in various image formats. Here's an example of an archived collision, using some of corkami's example images. Both images show in the image search because Asagi deduplication is per-board.
 * The "chosen/fixed-prefix" exploit allows for an arbitrary pair of chosen files to be appended with "collision blocks" until they share the same md5. More info on this style of collision can be found here. This style of exploit can be countered on the backend by removing bytes past the media file trailer (a pattern signifying the end of the file), and it seems that this is part of the post-processing 4chan does on media upload.

The existing types of md5 collision exploits are not known to pose a major risk to either the main site or its archives, because they can only be performed intentionally by an "attacker" who must generate and post both pieces of media. However, they do introduce a quirk which has a minor impact on the integrity of the archive:


 * Hiding images from archives: a user can post two md5-colliding images to the same board with a delay, and the second image will never be archived by Asagi-based archives. This is thanks to the md5-based deduplication mechanism which it uses, which will skip downloading an image if its md5 is already present in its database (the 4Chan API has an md5 field). This exploit is somewhat concerning since it prevents 100% fidelity at the post level: [the wrong image will be linked in the archive].

Neither of the above types of attacks are pre-image attacks; weaponizable pre-image attacks would pose a much larger risk to the mainsite and archives. If semi-arbitrary images could be generated with the same md5 as another arbitrary image posted by another user, automod or mod systems relying on media hashes could be gamed to ban non-offending users or media. e.g. a user could post an image, and then another user could generate an image with illegal or ban-worthy content sharing the same md5 as the first user. The first user or media file could end up banned from the mainsite and/or archives, along with the offending user/media. Note that this scenario is completely theoretical, and even if a pre-image attack were to exist, it would also need to be very flexible to be weaponized in this way.

= Mitigations =

A way to eliminate the effects of md5 collisions on the archive side would be to replace usage of md5 hashes with a more robust type (probably SHA256). This would have the following consequences:


 * The archiver would need to download every image it encounters, instead of doing a deduplication check.
 * The current deduplication process is to only download if the main site API md5 field does not match the md5 of a previously downloaded image.
 * Since the main site API does not have an SHA256 field, the image would need to be downloaded and the SHA256 generated locally (with any associated performance penalty for the hashing).
 * As a result, the server's media download rate would need to go up a bit (increased archive download bandwidth and Cloudflare etc. rate exhaustion).
 * Importing image dumps would require reprocessing.
 * If the hash is to be used to implicitly address media files (one of several methods), a new field would need to be added to the archive DB/API.

The gist is that it would add quite a bit of complexity to handle a small number of deliberately-introduced colliding images (perfect vs. 99.999999% integrity).

OR the archiver could do away with deduplication altogether, and use up a ton more storage while also needing to redownload identical images. This would be simpler, but it's not really an option since the archivers are very cost-sensitive.

Other workarounds have been proposed to evaluate media file uniqueness before downloading, i.e. using other fields in conjunction with the md5, but no valid solution has yet been found.


 * Using the filesize in conjunction with the md5 is no better than using just the md5, since the known collision attacks generate files with the same filesize.
 * Unique thumbnail hashes (requiring a much smaller download than the full-size media for verification) are not a guarantee of uniqueness either, since colliding gifs can have the same first frame, and the first frame determines the thumbnail.