How cosine similarity in embedding space can power high-quality normalization.
Author
Matt Hodges
Published
August 2, 2025
A common challenge in working with operational or CRM-style data is that you often find yourself dealing with user-entered free text. A recurrent example comes when users fill out forms that ask for their job title and employer. This data might seem secondary, but for any organization trying to understand, segment, or personalize communication with its users, it’s incredibly valuable.
Of course, user-entered data is messy. One person types “nurse,” another “ER Nurse,” another “RN,” and yet another “home health nursing.” If you want to understand the composition of your user base, or build automated systems that adapt to it, you need to normalize that chaos into a finite and meaningful taxonomy. You wouldn’t want a dashboard full of job titles in SpongeBob casing, but realistically, you have to work with whatever comes through the form.
A data-forward organization might use this information for all kinds of purposes: tailoring outreach, prioritizing leads, enriching analytics, or even customizing onboarding flows. But none of that works without being clean, consistent, and structured.
With language models, we can do better. We don’t need to predefine normalization rules or manually review each row. And we don’t even need much prior knowledge about our users to start.
But before we jump straight to the AI, let’s define our approach:
To normalize free-form text, there must be a finite set of target categories.
If we know nothing about users in advance, we need a reliable way to discover or define those categories.
We’re decidedly not using a chatbot, and we’re not relying on external APIs.
This isn’t a generative task; it’s about semantic understanding.
Let’s start at the top. There are an infinite number of values a user could enter for their job, and we want to reduce that to a finite set. So where do we get that set?
The Occupational Information Network (O*NET) maintains exactly such a resource. Developed under the sponsorship of the Departoment of Labor, O*NET offers rich datasets that describe skills, knowlege, tasks, and job titles. We’re interested in the Alternate Titles file, which maps occupation titles to alternate “lay” job titles. There’s a good chance many of our users enter these alternate titles, so we’ll want to include them.
The file includes columns of Department of Labor and Census identifiers, but we only need the few that focus on title. Let’s download it and take a look at a few examples:
import numpy as npimport pandas as pdonet_df = pd.read_excel("https://www.onetcenter.org/dl_files/database/db_29_3_excel/Alternate%20Titles.xlsx", usecols=['Title', 'Alternate Title', 'Short Title'],).fillna("")onet_df.sample(n=5, random_state=101) # seed for reproducibility
Title
Alternate Title
Short Title
25268
Cargo and Freight Agents
Shipping Agent
30407
Helpers, Construction Trades, All Other
Maintenance Construction Helper
6045
Bioengineers and Biomedical Engineers
Biomedical Engineering Intern
18684
Occupational Therapy Aides
Rehabilitation Therapy Aide (Rehab Therapy Aide)
Rehab Therapy Aide
24086
Billing and Posting Clerks
Statement Services Representative (Statement S...
Statement Services Rep
So O*NET tells us that Cargo and Freight Agents might also go by a Shipping Agent as an Alternate Title and that Occupational Therapy Aides might also go by a Rehab Therapy Aide as a Short Title.
We also see that there can be many rows of different Alternate Title and Short Title for the same Title:
onet_df[onet_df["Title"] =="Software Developers"]
Title
Alternate Title
Short Title
4931
Software Developers
.NET Developer
4932
Software Developers
Android Developer
4933
Software Developers
AngularJS Developer
4934
Software Developers
Apache Hadoop Developer
4935
Software Developers
Application Architect
...
...
...
...
5061
Software Developers
User Interface Designer
5062
Software Developers
Video Game Engineer
5063
Software Developers
Wide Area Network Engineer (WAN Engineer)
WAN Engineer
5064
Software Developers
Windows Software Engineer
5065
Software Developers
XML Developer (Extensible Markup Language Deve...
XML Developer
135 rows × 3 columns
There is one thing we do know ahead of time about our users: not all of them will be employed. The O*NET data set doesn’t provide a job title for not working, so let’s add our own:
Now let’s merge these fields together. Since we’ll be leveraging a language model, we can take the liberties of language here; we don’t need clean many-to-many relationships. Just combine Title, Alternate Title, and when available, Short Title into one Long Title field with "aka" inline:
Great. We’ve satisfied parts one and two of our approach. We have a finite set of job titles, and we have a good understanding that the set is large but not exhaustive, and it combines multiple values for a job title. Let’s start modeling language.
We can’t use JobBERT out of the box, we’ll need to incorporate our O*NET dataset. Let’s pull it down and start building out our implementation of the model. To do this we’re going to leverage word embeddings against our Long Title values. If you’re unfamiliar with langage model embeddings, Simon Willison has a fantastic overview that you should go read now. But the gist of it is: embeddings are how language models numerically encode meaning from language into a large vector. This is suprisingly powerful, and yields operations like:
emb('king') - emb('man') + emb('woman') which returns a vector that is mathmatically very close to emb('queen').
We’re going to use this “closeness” between vectors to reduce infinite free-form data to our finite Long Title data and then map it back to Title. The first thing to do is quite simple: calculate JobBERT embeddings on all of the values in our Long Title column:
As a reader, the embedding column is an indecipherable array of floats, but now we can do some cool things. Here are three rows from our data:
slice= onet_df[ (onet_df["Long Title"] =="Software Developers aka Video Game Engineer") | (onet_df["Long Title"] =="Database Architects aka Information Architect") | (onet_df["Long Title"] =="Cargo and Freight Agents aka Shipping Agent")]slice[["Long Title", "embedding"]]
Long Title
embedding
4806
Database Architects aka Information Architect
[0.020110216, 0.06301323, -0.029263753, -0.022...
5062
Software Developers aka Video Game Engineer
[0.041696787, 0.024444718, -0.053837907, 0.031...
25268
Cargo and Freight Agents aka Shipping Agent
[-0.034480397, -0.01120864, -0.005822623, -0.0...
In a vector space, you can evaluate how similar two vectors are by taking their cosine similarity. Since we normalized our vectors when we embedded them, the denominator in the cosine function becomes 1 so we can do this even more efficiently with just a dot product:
database_architect =slice.iloc[0]["embedding"]software_developer =slice.iloc[1]["embedding"]cargo_and_freight_agent =slice.iloc[2]["embedding"]print(f"Software Developer vs Data Architect: {software_developer @ database_architect}")print(f"Software Developer vs Cargo Agent: {software_developer @ cargo_and_freight_agent}")
Software Developer vs Data Architect: 0.2380954772233963
Software Developer vs Cargo Agent: 0.08600345253944397
Those numbers look perfectly reasonable: a modest overlap (≈ 0.24) between two tech roles and an almost-orthogonal relationship (≈ 0.09) to the Cargo Agent job.
Great. So now we have a mathematical way to compare the language of two job titles. And we’re not touching chatbots or 3rd party APIs at inferrence. The DataFrame is a self-contained model for semantically matching across the O*NET dataset. We’ve fully satisfied our approach!
Now to apply it to our problem. Instead of evaluating O*NET data against itself, we can use our embeddings to evaluate any free-form job title text a user might submit.
Let’s go get some real data to try it out! Since I work in political tech, I like to reach for campaign donor data.
Our friends over at ProPublica publish itemized ActBlue receipts by state. Because ActBlue is a conduit committee, these files include every transaction of any amount. That’s a lot of transactions! Let’s grab all of the ActBlue transactions from Texas for June 2024.
That occupation field came from donors and doesn’t perfectly match our modeled job titles. But we don’t need it to! Let’s use our model to calculate embeddings on these new values:
So now we have two sets of embeddings: we have our O*NET embeddings and we have our ActBlue donor embeddings. Just as before, we can calculate similaries between them. But unlike before, we need to calculate a lot. In order to find the best match we need to compare every O*NET embedding vector with every ActBlue embedding vector. That’s a lot of comparisons. The good news, this is what GPUs are good at, and a free-tier GPU in Google Colab can kick this out fast.
We convert our O*NET embedding column into a (n × d) tensor, where n is the number of rows and d is the vector length. Similarly, we convert the ActBlue embedding column into a (m × d) tensor where m is the number of ActBlue rows.
When pushing this to a GPU, it’s a little more art than science. We batch it, and picking an optimal batch size can take some trial and error. For every batch, we’ll calculate the dot product, and return the indices of the best similarities.
From there, we can map all the way back to our original O*NET Title column, as our normalized output:
A way to visualized this is with principal component analysis. PCA computes new orthogonal axes called principal components that capture the most variation in the data. These directions are combinations of the original dimensions, chosen to reveal the biggest patterns and differences. By projecting each vector onto the first two principal components, we can plot everything in two dimensions while keeping as much of the original structure as possible:
code ninja → Computer Programmers
uber → Taxi Drivers
GM → General and Operations Managers
Professor, Artist → Art, Drama, and Music Teachers, Postsecondary
Postdoctoral Fellow → Clinical Research Coordinators
DARE Officer → Police and Sheriff's Patrol Officers
Senator → Legislators
Comms Director → Public Relations Managers
dermatology → Dermatologists
commodities trader → Securities, Commodities, and Financial Services Sales Agents
life insurance adjuster → Claims Adjusters, Examiners, and Investigators
lumberjack → Fallers
cpa → Accountants and Auditors
It’s not perfect. I don’t think I would have normalized Postdoctoral Fellow to Clinical Research Coordinators. This is where decisions we made in constructing the model start to show themselves. All those aka job title concatenations we did at the top affected how the language was modeled. Depending on your use case this could be fine; maybe you just need deterministic finite categories. You can play with the embedding calculation process to see how different strategies yield different resules.
As for our nurses:
print(normalize("nurse"))print(normalize("RN"))print(normalize("MSN"))print(normalize("ER Nurse"))print(normalize("home health nursing"))
nurse → Registered Nurses
RN → Registered Nurses
MSN → Registered Nurses
ER Nurse → Registered Nurses
home health nursing → Registered Nurses
Language models are pretty good at modeling language!
🖤 Thank you for reading a personal blog. This post was written in a specific time and place. I reserve the right to learn new things and to change my mind. This site has no traffic analytics, social trackers, or ads. If you enjoyed this post, please consider sharing it however you like to share posts.