About once a year, I’m reminded of the fact that a lot of marketing SaaS and ad tech dresses up cryptographic hashes as a sort of privacy theater. This shows up frequently in product features for suppression lists with the general idea of uploading hashed values of email addresses or phone numbers to enable matching while preserving privacy. The problem is, hashing PII does not protect privacy.
The long and short of it is: hashing is only effective if the input data is unbounded. It’s why long and unpredictable passwords are necessary, even with a robust hashing function. PII is neither long nor unpredictable.
- You can download every baby name going back to 1880 from the Social Security administration.
- Email addresses follow the format
something@something.something
. - Social Security numbers are 9 digits, so there are at most 1 billion.1
- North American phone numbers are 10 digits, so there are at most 10 billion.2
Despite this, marketing tools still shuffle around PII hashes of this data. For example, here’s BambooHR:
In order to better identify any shared customers we may have, we have decided to compare our customer lists encoded as MD5 Hashes. By encoding our respective customer lists in MD5 Hashes, we will be able to compare customer lists without disclosing any customer info (including customer name).
And platforms like UnsubCentral:
Manually entering in phone numbers to a suppression tool is a waste of time and resources. Our tool can take plain text and compare it against MD5 or SHA hashed lists of phone numbers – simply throw in the data, and it will do the hard work for you.
Everyone is trying to do private set intersection, but doing this with hash-passing is trivially broken on modern consumer hardware. And you don’t even need special password cracking software to break it.
On a laptop, we can build a rainbow table of Parquet files for every North American phone number. We can abuse DuckDB as a hashing mill, and generate every MD5 in the 5XX
area code block:
=8;
PRAGMA threads='/tmp';
PRAGMA temp_directory
COPY (
WITH gen AS (
SELECT i::BIGINT AS number, md5(i::VARCHAR) AS hash
FROM range(5000000000, 6000000000) t(i)
)SELECT substr(hash,1,2) AS h2, (number % 8) AS shard, number, hash
FROM gen
TO 'out/md5nums'
) 'parquet',
(FORMAT
PARTITION_BY (h2, shard),'zstd',
COMPRESSION 250000); ROW_GROUP_SIZE
What this does:
- Specify DuckDB to use up to 8 execution threads.
- For each generated integer
i
within5,000,000,000
-5,999,999,999
, cast to text and compute its MD5. The generator is vectorized and parallelized, so it feeds the pipeline quickly. - Take the first two hex chars (
00
-ff
). Thish2
is a hash prefix for partitioning so lookups by full hash only have to read one small directory. - With modulo-8, sharding spreads each hash-prefix partition, which give parallelism during writes and avoids large single files.
- Write Parquet files, and organize output as directories per distinct
(h2, shard)
; this gives a structure like:out/md5nums/h2=ab/shard=3/part-*.parquet
. - Use Zstandard compression inside Parquet (efficient for
number
but nothash
).
Crunch:
~/ $ duckdb -c ".read md5gen.sql"
100% ▕██████████████████████████████████████▏ (00:40:41.41 elapsed)
And now with the hash af82af0fad119df159ce350db422d3b3
we can reverse back to its phone number:
~/ $ H=af82af0fad119df159ce350db422d3b3
~/ $ P=${H:0:2}
~/ $ duckdb -c \
> "SELECT number FROM 'out/md5nums/h2=${P}/**/*.parquet' WHERE hash='${H}' LIMIT 1;"
100% ▕██████████████████████████████████████▏ (00:00:02.84 elapsed)
┌────────────────┐
│ number │
│ int64 │
├────────────────┤
│ 5555550123 │
│ (5.56 billion) │
└────────────────┘
Here we do:
H=...
is the hash.P=${H:0:2}
grabs the first two hex chars (af
) to match the partitioned dataset.FROM 'out/md5nums/h2=${P}/**/*.parquet'
allows DuckDB to read only files inside the partition directory whose name encodesh2=${P}
. The**/*.parquet
is a recursive glob so it also walks into theshard=...
subfolders instead of scanning the entire billion-row dataset.- DuckDB vector-scans row groups within that one
h2
partition until it finds a match.
MD5 is incredibly broken, but this approach is hash-agnostic. The problem is the application on low-entropy input data, not the specific hashing algorithm.
File scanning here is just a proof-of-concept. For true query speed you can trade file portability for a database index. Once your Parquet files are built, you can load them into DuckDB as a persistent table and create an index on the hash
column. Or, Postgres with a B-tree index is also a great fit.
On my 2020 M1 MacBook Air, computing 1 billion 5XX
phone number hashes completes in about 40 minutes. There are about 6.3 billion valid North American phone numbers, so building a hash lookup table for all of them would take my little laptop just over 4 hours.
None of this is breaking news. In 2021, researchers hashed 118 billion phone numbers:
The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers
The thing is, you really don’t need a high-performance cluster anymore.