Last week at work, I encountered an interesting challenge. I was building REST API endpoints for a React component that let the user select an image from a large number of files on disk. To store which image the user had picked, I needed to give each one a unique identifier. Unfortunately, these image assets did not have any IDs (such as numbers) associated with them. The only thing I could use as an identifier was each file’s path and name.
Wishlist
It was pretty clear I was going to need to convert the file paths to short strings using some sort of hashing algorithm. The resulting strings would need to be:
- Short (we were going to be using these in URLs, possible many of them at once)
- Unique, with a low probability of collisions when used for around 1000 images
- URL-safe characters only
A collision is when two different input values accidentally result in the same output. In my case this could means that the wrong image would be served. The longer and more complex a hash gets, the less likely this is to occur.
URL-safe
Because we were going to be passing these IDs to a REST API, we needed something URL-safe. The paths and filenames used all sorts of characters that would cause issues. No good.
There are only 66 characters that you can use in a URL that have no reserved purpose.
- Numbers 0-9
- Letters a-z and A-Z
- The characters dash (“-“), underscore (“_”), dot (“.”) and tilde (“~”)
CRC32
Obviously, I attempted to find an existing algorithm. Using something that’s been around for a while and battle-tested is almost always a better idea than trying to come up with your own solution. Especially with things like cryptography and hashes.
I found a lot of posts recommending CRC32 for this. But one of those posts warned that it had a one percent chance of collisions when used with only around 9292 assets. I confirmed this using an online Birthday Problem calculator. A bit too close for comfort.
In PHP (by default), a CRC32 hash is returned as an 8-character string. Every character in this string represents four bits and has 16 possible values (0-9 or a-f). This is a common string representation of hex values. This gives us roughly 4.3 billion possible values (16^8).
Alphanumeric instead of hex
This got me thinking what would happen if I’d use all or most of the 66 URL-safe character. The same eight characters, but with 64 possible values each, would give us around 2.8e+14 possible values. 65,536 times more than with just hex.
If my math is correct, this would result in a 1% chance of collisions with 2.4 million assets (as opposed to just 9300 with CRC32). But the hash strings take up the same space. And the difference would increase with longer hash lengths.
Hash length | Values per char | Nr of possible values | 1% collision at x assets | Collission at 100k assets |
---|---|---|---|---|
8 | 16 | 4,294,967,296 | 9,292 | 68.78% |
8 | 64 | 2.81474976711e+14 | 2,378,621 | 0% |
10 | 16 | 1.09951162778e+12 | 148,664 | 0.45% |
10 | 64 | 1.15292150461e+18 | 152,231,721 | 0% |
An obvious tool for this is base64. It encodes data using just 64 of the most conservative characters in the ASCII set. Not all of those are URL-safe, but you can easily convert the three offending characters to something else.
My solution
/**
* Create a short, fairly unique, urlsafe hash for the input string.
*/
function generate_id( $input, $length = 8 ){
// Create a raw binary sha256 hash and base64 encode it.
$hash_base64 = base64_encode( hash( 'sha256', $input, true ) );
// Replace non-urlsafe chars to make the string urlsafe.
$hash_urlsafe = strtr( $hash_base64, '+/', '-_' );
// Trim base64 padding characters from the end.
$hash_urlsafe = rtrim( $hash_urlsafe, '=' );
// Shorten the string before returning.
return substr( $hash_urlsafe, 0, $length );
}
Code language: PHP (php)
In short, this is what the code above does:
- Use a well-known, strong hashing algorithm to generate a long binary hash (256 bits).
- Convert that hash to base64 instead of hex, giving us 64 possible values for each character. This gives us a string that’s 44 characters long. The normal output would be 64 characters, so this condenses the data by over 30%.
- Swap out the non-URL-safe characters for ones that are allowed.
- Remove the surplus equal signs that base64 encoding tends to append.
- Shorten the string to make it “portable”.
That last step does throw away over 80% of the original data. I’m aware that this significantly increases the chance of collisions. But please keep in mind that with a full sha256 hash, the odds of a collision happening are incredibly slim. As far as I’m aware, this has never happened. Not even with basically the entire Bitcoin network trying. In my project, I’m using a string length of 10, just to be extra sure.
Suggestions
The solution above is my attempt to build on existing solutions, and in just a couple of lines of code, create a hash containing more entropy per character. Please let me know if you have any suggestions to improve on this.