# The Privacy Theater of Hashed PII
Matt Hodges
2025-10-19

About once a year, I’m reminded of the fact that a lot of marketing SaaS
and ad tech dresses up cryptographic hashes as a sort of privacy
theater. This shows up frequently in product features for [suppression
lists](https://en.wikipedia.org/wiki/Suppression_list) with the general
idea of uploading hashed values of email addresses or phone numbers to
enable matching while preserving privacy. The problem is, [hashing PII
does not protect
privacy](https://www.johndcook.com/blog/2019/07/20/hashing-pii-does-not-protect-privacy/).

The long and short of it is: hashing is only effective if the input data
is unbounded. It’s why [long and unpredictable
passwords](https://pages.nist.gov/800-63-4/sp800-63b.html#appA) are
necessary, even with a robust hashing function. PII is neither long nor
unpredictable.

- You can download [every baby name going back to
  1880](https://www.ssa.gov/oact/babynames/names.zip) from the Social
  Security administration.
- Email addresses follow the format `something@something.something`.
- Social Security numbers are 9 digits, so there are at most 1
  billion.[^1]
- North American phone numbers are 10 digits, so there are at most 10
  billion.[^2]

Despite this, marketing tools still shuffle around PII hashes of this
data. For example, here’s
[BambooHR](https://partners.bamboohr.com/md5hash/):

> In order to better identify any shared customers we may have, we have
> decided to compare our customer lists encoded as MD5 Hashes. By
> encoding our respective customer lists in MD5 Hashes, we will be able
> to compare customer lists without disclosing any customer info
> (including customer name).

And platforms like
[UnsubCentral](https://www.unsubcentral.com/phone-number-suppression/):

> Manually entering in phone numbers to a suppression tool is a waste of
> time and resources. Our tool can take plain text and compare it
> against MD5 or SHA hashed lists of phone numbers – simply throw in the
> data, and it will do the hard work for you.

Everyone is trying to do [private set
intersection](https://en.wikipedia.org/wiki/Private_set_intersection),
but doing this with hash-passing is trivially broken on modern consumer
hardware. And you don’t even need special password cracking software to
break it.

On a laptop, we can build a [rainbow
table](https://blog.codinghorror.com/rainbow-hash-cracking/) of
[Parquet](https://parquet.apache.org/) files for every North American
phone number. We can abuse [DuckDB](https://duckdb.org/) as a hashing
mill, and generate every MD5 in the `5XX` area code block:

``` sql
PRAGMA threads=8;
PRAGMA temp_directory='/tmp';

COPY (
  WITH gen AS (
    SELECT i::BIGINT AS number, md5(i::VARCHAR) AS hash
    FROM range(5000000000, 6000000000) t(i)
  )
  SELECT substr(hash,1,2) AS h2, (number % 8) AS shard, number, hash
  FROM gen
) TO 'out/md5nums'
    (FORMAT 'parquet',
    PARTITION_BY (h2, shard),
    COMPRESSION 'zstd',
    ROW_GROUP_SIZE 250000);
```

What this does:

- Specify DuckDB to use up to 8 execution threads.
- For each generated integer `i` within `5,000,000,000` -
  `5,999,999,999`, cast to text and compute its MD5. The generator is
  vectorized and parallelized, so it feeds the pipeline quickly.
- Take the first two hex chars (`00` - `ff`). This `h2` is a **hash
  prefix** for partitioning so lookups by full hash only have to read
  one small directory.
- With modulo-8, sharding spreads each hash-prefix partition, which give
  parallelism during writes and avoids large single files.
- Write Parquet files, and organize output as directories per distinct
  `(h2, shard)`; this gives a structure like:
  `out/md5nums/h2=ab/shard=3/part-*.parquet`.
- Use [Zstandard](https://en.wikipedia.org/wiki/Zstd) compression inside
  Parquet (efficient for `number` but not `hash`).

*Crunch:*

``` sh
 ~/ $ duckdb -c ".read md5gen.sql"
100% ▕██████████████████████████████████████▏ (00:40:41.41 elapsed)
```

And now with the hash `af82af0fad119df159ce350db422d3b3` we can reverse
back to its phone number:

``` sh
 ~/ $ H=af82af0fad119df159ce350db422d3b3
 ~/ $ P=${H:0:2}
 ~/ $ duckdb -c \
> "SELECT number FROM 'out/md5nums/h2=${P}/**/*.parquet' WHERE hash='${H}' LIMIT 1;"
100% ▕██████████████████████████████████████▏ (00:00:02.84 elapsed)
┌────────────────┐
│     number     │
│     int64      │
├────────────────┤
│   5555550123   │
│ (5.56 billion) │
└────────────────┘
```

Here we do:

- `H=...` is the hash.
- `P=${H:0:2}` grabs the first two hex chars (`af`) to match the
  partitioned dataset.
- `FROM 'out/md5nums/h2=${P}/**/*.parquet'` allows DuckDB to read
  **only** files inside the partition directory whose name encodes
  `h2=${P}`. The `**/*.parquet` is a recursive glob so it also walks
  into the `shard=...` subfolders instead of scanning the entire
  billion-row dataset.
- DuckDB vector-scans row groups within that one `h2` partition until it
  finds a match.

> [!NOTE]
>
> MD5 is incredibly
> [broken](https://en.wikipedia.org/wiki/MD5#Security), but this
> approach is hash-agnostic. The problem is the application on
> low-entropy input data, not the specific hashing algorithm.

File scanning here is just a proof-of-concept. For true query speed you
can trade file portability for a database index. Once your Parquet files
are built, you can load them into DuckDB as a persistent table and
create an index on the `hash` column. Or, Postgres with a B-tree index
is also a great fit.

On my 2020 M1 MacBook Air, computing 1 billion `5XX` phone number hashes
completes in about **40 minutes**. [There are about 6.3 billion valid
North American phone
numbers](https://en.wikipedia.org/wiki/North_American_Numbering_Plan),
so building a hash lookup table for all of them would take my little
laptop just over 4 hours.

> [!IMPORTANT]
>
> [Salting](https://en.wikipedia.org/wiki/Salt_(cryptography)) does not
> solve this problem. With per-record random salts, identical
> identifiers hash differently across parties, so no deterministic
> equality join is possible. Publishing those salts still doesn’t enable
> a direct join: the only viable path is recover-then-join, which costs
> $\sim D \times (N + M)$. That’s operationally impractical for a
> bulk-matching service, but practically feasible for an adversary
> performing offline, per-record brute force over time. It also reveals
> non-matches, because recovery exposes plaintexts beyond the overlap.
> If parties instead share a global salt to keep hashes comparable, the
> domain is trivially enumerable and a lookup table can be rebuilt.

None of this is breaking news. In 2021, [researchers hashed 118 billion
phone
numbers](https://www.ndss-symposium.org/wp-content/uploads/ndss2021_1C-3_23159_paper.pdf):

> The limited amount of possible mobile phone numbers combined with the
> rapid increase in affordable storage capacity makes it feasible to
> create key-value databases of phone numbers indexed by their hashes
> and then to perform constant-time lookups for each given hash value.
> We demonstrate this by using a high-performance cluster to create an
> in-memory database of all 118 billion possible mobile phone numbers

The thing is, you really don’t need a high-performance cluster anymore.

[^1]: [It’s
    fewer.](https://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html)

[^2]: [It’s fewer.](https://www.nanpa.com/)
