The Privacy Theater of Hashed PII

A 2020 MacBook Air can hash every North American phone number in four hours

Author

Matt Hodges

Published

October 19, 2025

About once a year, I’m reminded of the fact that a lot of marketing SaaS and ad tech dresses up cryptographic hashes as a sort of privacy theater. This shows up frequently in product features for suppression lists with the general idea of uploading hashed values of email addresses or phone numbers to enable matching while preserving privacy. The problem is, hashing PII does not protect privacy.

The long and short of it is: hashing is only effective if the input data is unbounded. It’s why long and unpredictable passwords are necessary, even with a robust hashing function. PII is neither long nor unpredictable.

Despite this, marketing tools still shuffle around PII hashes of this data. For example, here’s BambooHR:

In order to better identify any shared customers we may have, we have decided to compare our customer lists encoded as MD5 Hashes. By encoding our respective customer lists in MD5 Hashes, we will be able to compare customer lists without disclosing any customer info (including customer name).

And platforms like UnsubCentral:

Manually entering in phone numbers to a suppression tool is a waste of time and resources. Our tool can take plain text and compare it against MD5 or SHA hashed lists of phone numbers – simply throw in the data, and it will do the hard work for you.

Everyone is trying to do private set intersection, but doing this with hash-passing is trivially broken on modern consumer hardware. And you don’t even need special password cracking software to break it.

On a laptop, we can build a rainbow table of Parquet files for every North American phone number. We can abuse DuckDB as a hashing mill, and generate every MD5 in the 5XX area code block:

PRAGMA threads=8;
PRAGMA temp_directory='/tmp';

COPY (
  WITH gen AS (
    SELECT i::BIGINT AS number, md5(i::VARCHAR) AS hash
    FROM range(5000000000, 6000000000) t(i)
  )
  SELECT substr(hash,1,2) AS h2, (number % 8) AS shard, number, hash
  FROM gen
) TO 'out/md5nums'
    (FORMAT 'parquet',
    PARTITION_BY (h2, shard),
    COMPRESSION 'zstd',
    ROW_GROUP_SIZE 250000);

What this does:

Crunch:

 ~/ $ duckdb -c ".read md5gen.sql"
100% ▕██████████████████████████████████████▏ (00:40:41.41 elapsed)

And now with the hash af82af0fad119df159ce350db422d3b3 we can reverse back to its phone number:

 ~/ $ H=af82af0fad119df159ce350db422d3b3
 ~/ $ P=${H:0:2}
 ~/ $ duckdb -c \
> "SELECT number FROM 'out/md5nums/h2=${P}/**/*.parquet' WHERE hash='${H}' LIMIT 1;"
100% ▕██████████████████████████████████████▏ (00:00:02.84 elapsed)
┌────────────────┐
     number     │
     int64      │
├────────────────┤
   5555550123   │
 (5.56 billion) 
└────────────────┘

Here we do:

Note

MD5 is incredibly broken, but this approach is hash-agnostic. The problem is the application on low-entropy input data, not the specific hashing algorithm.

File scanning here is just a proof-of-concept. For true query speed you can trade file portability for a database index. Once your Parquet files are built, you can load them into DuckDB as a persistent table and create an index on the hash column. Or, Postgres with a B-tree index is also a great fit.

On my 2020 M1 MacBook Air, computing 1 billion 5XX phone number hashes completes in about 40 minutes. There are about 6.3 billion valid North American phone numbers, so building a hash lookup table for all of them would take my little laptop just over 4 hours.

None of this is breaking news. In 2021, researchers hashed 118 billion phone numbers:

The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers

The thing is, you really don’t need a high-performance cluster anymore.