The Privacy Theater of Hashed PII

About once a year, I’m reminded of the fact that a lot of marketing SaaS and ad tech dresses up cryptographic hashes as a sort of privacy theater. This shows up frequently in product features for suppression lists with the general idea of uploading hashed values of email addresses or phone numbers to enable matching while preserving privacy. The problem is, hashing PII does not protect privacy.

The long and short of it is: hashing is only effective if the input data is unbounded. It’s why long and unpredictable passwords are necessary, even with a robust hashing function. PII is neither long nor unpredictable.

You can download every baby name going back to 1880 from the Social Security administration.
Email addresses follow the format something@something.something.
Social Security numbers are 9 digits, so there are at most 1 billion.¹
North American phone numbers are 10 digits, so there are at most 10 billion.²

Despite this, marketing tools still shuffle around PII hashes of this data. For example, here’s BambooHR:

In order to better identify any shared customers we may have, we have decided to compare our customer lists encoded as MD5 Hashes. By encoding our respective customer lists in MD5 Hashes, we will be able to compare customer lists without disclosing any customer info (including customer name).

And platforms like UnsubCentral:

Manually entering in phone numbers to a suppression tool is a waste of time and resources. Our tool can take plain text and compare it against MD5 or SHA hashed lists of phone numbers – simply throw in the data, and it will do the hard work for you.

Everyone is trying to do private set intersection, but doing this with hash-passing is trivially broken on modern consumer hardware. And you don’t even need special password cracking software to break it.

On a laptop, we can build a rainbow table of Parquet files for every North American phone number. We can abuse DuckDB as a hashing mill, and generate every MD5 in the 5XX area code block:

PRAGMA threads=8;
PRAGMA temp_directory='/tmp';

COPY (
  WITH gen AS (
    SELECT i::BIGINT AS number, md5(i::VARCHAR) AS hash
    FROM range(5000000000, 6000000000) t(i)
  )
  SELECT substr(hash,1,2) AS h2, (number % 8) AS shard, number, hash
  FROM gen
) TO 'out/md5nums'
    (FORMAT 'parquet',
    PARTITION_BY (h2, shard),
    COMPRESSION 'zstd',
    ROW_GROUP_SIZE 250000);

What this does:

Specify DuckDB to use up to 8 execution threads.
For each generated integer i within 5,000,000,000 - 5,999,999,999, cast to text and compute its MD5. The generator is vectorized and parallelized, so it feeds the pipeline quickly.
Take the first two hex chars (00 - ff). This h2 is a hash prefix for partitioning so lookups by full hash only have to read one small directory.
With modulo-8, sharding spreads each hash-prefix partition, which give parallelism during writes and avoids large single files.
Write Parquet files, and organize output as directories per distinct (h2, shard); this gives a structure like: out/md5nums/h2=ab/shard=3/part-*.parquet.
Use Zstandard compression inside Parquet (efficient for number but not hash).

Crunch:

 ~/ $ duckdb -c ".read md5gen.sql"
100% ▕██████████████████████████████████████▏ (00:40:41.41 elapsed)

And now with the hash af82af0fad119df159ce350db422d3b3 we can reverse back to its phone number:

 ~/ $ H=af82af0fad119df159ce350db422d3b3
 ~/ $ P=${H:0:2}
 ~/ $ duckdb -c \
> "SELECT number FROM 'out/md5nums/h2=${P}/**/*.parquet' WHERE hash='${H}' LIMIT 1;"
100% ▕██████████████████████████████████████▏ (00:00:02.84 elapsed)
┌────────────────┐
│     number     │
│     int64      │
├────────────────┤
│   5555550123   │
│ (5.56 billion) │
└────────────────┘

Here we do:

H=... is the hash.
P=${H:0:2} grabs the first two hex chars (af) to match the partitioned dataset.
FROM 'out/md5nums/h2=${P}/**/*.parquet' allows DuckDB to read only files inside the partition directory whose name encodes h2=${P}. The **/*.parquet is a recursive glob so it also walks into the shard=... subfolders instead of scanning the entire billion-row dataset.
DuckDB vector-scans row groups within that one h2 partition until it finds a match.

Note

MD5 is incredibly broken, but this approach is hash-agnostic. The problem is the application on low-entropy input data, not the specific hashing algorithm.

File scanning here is just a proof-of-concept. For true query speed you can trade file portability for a database index. Once your Parquet files are built, you can load them into DuckDB as a persistent table and create an index on the hash column. Or, Postgres with a B-tree index is also a great fit.

On my 2020 M1 MacBook Air, computing 1 billion 5XX phone number hashes completes in about 40 minutes. There are about 6.3 billion valid North American phone numbers, so building a hash lookup table for all of them would take my little laptop just over 4 hours.

Important

Salting does not solve this problem. With per-record random salts, identical identifiers hash differently across parties, so no deterministic equality join is possible. Publishing those salts still doesn’t enable a direct join: the only viable path is recover-then-join, which costs $\sim D \times (N + M)$. That’s operationally impractical for a bulk-matching service, but practically feasible for an adversary performing offline, per-record brute force over time. It also reveals non-matches, because recovery exposes plaintexts beyond the overlap. If parties instead share a global salt to keep hashes comparable, the domain is trivially enumerable and a lookup table can be rebuilt.

None of this is breaking news. In 2021, researchers hashed 118 billion phone numbers:

The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers

The thing is, you really don’t need a high-performance cluster anymore.

Footnotes