Ayase/MD5 Collisions

From Bibliotheca Anonoma
Revision as of 23:04, 4 September 2019 by Baystdev (talk | contribs) (Adds mitigations section)

MD5 Vulnerabilities

There are two forms of the MD5 collision exploit discovered so far: a fixed-prefix and a unfixed-prefix collision mechanism.

  • The "unfixed-prefix" style of exploit inserts "collision blocks" within otherwise identical files to generate md5 collisions. Files generated with this style of collision have been demonstrated to pass through any 4chan post-processing steps without alteration. 4chan and its archives are vulnerable to at least gif md5 collisions, and probably to exploits crafted for other file formats as well. For more information on unfixed-prefix collisions, see [hashclash], [this exploitation of hashclash], and [this article] on md5 collisions in image formats.
  • The "fixed-prefix" exploit allows for an arbitrary pair of chosen files to be appended with "collision blocks" until they share the same md5. More info on this style of collision can be found here: [1] . This style of exploit can be countered on the backend by removing bytes past the media file trailer (a pattern signifying the end of the file), and it seems that this is part of the post-processing 4chan does on media upload.
  • The existing types of md5 collision exploits are not known to pose a major risk to either the main site or its archives, because they can only be performed intentionally by an "attacker" who must generate and post both pieces of media. However, they do introduce a quirk which has a minor impact on the integrity of the archive:
    • Hiding images from archives: a user can post two md5-colliding images to the same board with a delay, and the second image will never be archived by Asagi-based archives. This is thanks to the md5-based deduplication mechanism which it uses, which will skip downloading an image if its md5 is already present in its database (the 4Chan API has an md5 field). This exploit is somewhat concerning since it prevents 100% fidelity at the post level: [the wrong image will be linked in the archive].
  • Neither of the above types of attacks are pre-image attacks; weaponizable pre-image attacks would pose a much larger risk to the mainsite and archives. If semi-arbitrary images could be generated with the same md5 as another arbitrary image posted by another user, automod or mod systems relying on media hashes could be gamed to ban non-offending users or media. e.g. a user could post an image, and then another user could generate an image with illegal or ban-worthy content sharing the same md5 as the first user. The first user or media file could end up banned from the mainsite and/or archives, along with the offending user/media. Note that this scenario is completely theoretical, and even if a pre-image attack were to exist, it would also need to be very flexible to be weaponized in this way.

Mitigations

A way to eliminate the effects of md5 collisions on the archive side would be to replace usage of md5 hashes with a more robust type (probably SHA256). This would have the following consequences:

  • The archiver would need to download every image it encounters, instead of doing a deduplication check.
    • The current deduplication process is to only download if the main site API md5 field does not match the md5 of a previously downloaded image.
    • Since the main site API does not have an SHA256 field, the image would need to be downloaded and the SHA256 generated locally (with any associated performance penalty for the hashing).
    • As a result, the server's media download rate would need to go up a bit (increased archive download bandwidth and Cloudflare etc. rate exhaustion).
  • Importing image dumps would require reprocessing.
  • If the hash is to be used to implicitly address files (one of several methods), a new field would need to be added to the archive DB/API.

Other workarounds have been proposed to evaluate media file uniqueness before downloading, i.e. using other fields in conjunction with the md5, but no valid solution has yet been found.

  • Using the filesize in conjunction with the md5 is no better than using just the md5, since the known collision attacks generate files with the same filesize.
  • Unique thumbnail hashes (requiring a much smaller download than the full-size media for verification) are not a guarantee of uniqueness either, since colliding gifs can have the same first frame, and the first frame determines the thumbnail.