Using LLM Embeddings to Normalize User Data

How cosine similarity in embedding space can power high-quality normalization.

Author

Matt Hodges

Published

August 2, 2025

A common challenge in working with operational or CRM-style data is that you often find yourself dealing with user-entered free text. A recurrent example comes when users fill out forms that ask for their job title and employer. This data might seem secondary, but for any organization trying to understand, segment, or personalize communication with its users, it’s incredibly valuable.

Of course, user-entered data is messy. One person types “nurse,” another “ER Nurse,” another “RN,” and yet another “home health nursing.” If you want to understand the composition of your user base, or build automated systems that adapt to it, you need to normalize that chaos into a finite and meaningful taxonomy. You wouldn’t want a dashboard full of job titles in SpongeBob casing, but realistically, you have to work with whatever comes through the form.

A data-forward organization might use this information for all kinds of purposes: tailoring outreach, prioritizing leads, enriching analytics, or even customizing onboarding flows. But none of that works without being clean, consistent, and structured.

With language models, we can do better. We don’t need to predefine normalization rules or manually review each row. And we don’t even need much prior knowledge about our users to start.

But before we jump straight to the AI, let’s define our approach:

  1. To normalize free-form text, there must be a finite set of target categories.
  2. If we know nothing about users in advance, we need a reliable way to discover or define those categories.
  3. We’re decidedly not using a chatbot, and we’re not relying on external APIs.
  4. This isn’t a generative task; it’s about semantic understanding.

Let’s start at the top. There are an infinite number of values a user could enter for their job, and we want to reduce that to a finite set. So where do we get that set?

The Occupational Information Network (O*NET) maintains exactly such a resource. Developed under the sponsorship of the Departoment of Labor, O*NET offers rich datasets that describe skills, knowlege, tasks, and job titles. We’re interested in the Alternate Titles file, which maps occupation titles to alternate “lay” job titles. There’s a good chance many of our users enter these alternate titles, so we’ll want to include them.

The file includes columns of Department of Labor and Census identifiers, but we only need the few that focus on title. Let’s download it and take a look at a few examples:

import numpy as np
import pandas as pd

onet_df = pd.read_excel(
    "https://www.onetcenter.org/dl_files/database/db_29_3_excel/Alternate%20Titles.xlsx",
    usecols=['Title', 'Alternate Title', 'Short Title'],
).fillna("")

onet_df.sample(n=5, random_state=101)  # seed for reproducibility
Title Alternate Title Short Title
25268 Cargo and Freight Agents Shipping Agent
30407 Helpers, Construction Trades, All Other Maintenance Construction Helper
6045 Bioengineers and Biomedical Engineers Biomedical Engineering Intern
18684 Occupational Therapy Aides Rehabilitation Therapy Aide (Rehab Therapy Aide) Rehab Therapy Aide
24086 Billing and Posting Clerks Statement Services Representative (Statement S... Statement Services Rep

So O*NET tells us that Cargo and Freight Agents might also go by a Shipping Agent as an Alternate Title and that Occupational Therapy Aides might also go by a Rehab Therapy Aide as a Short Title.

We also see that there can be many rows of different Alternate Title and Short Title for the same Title:

onet_df[onet_df["Title"] == "Software Developers"]
Title Alternate Title Short Title
4931 Software Developers .NET Developer
4932 Software Developers Android Developer
4933 Software Developers AngularJS Developer
4934 Software Developers Apache Hadoop Developer
4935 Software Developers Application Architect
... ... ... ...
5061 Software Developers User Interface Designer
5062 Software Developers Video Game Engineer
5063 Software Developers Wide Area Network Engineer (WAN Engineer) WAN Engineer
5064 Software Developers Windows Software Engineer
5065 Software Developers XML Developer (Extensible Markup Language Deve... XML Developer

135 rows × 3 columns

There is one thing we do know ahead of time about our users: not all of them will be employed. The O*NET data set doesn’t provide a job title for not working, so let’s add our own:

additions = pd.DataFrame(
    [
        {"Title": "Unemployed", "Alternate Title": "Not Employed"},
        {"Title": "Unemployed", "Alternate Title": "None"},
        {"Title": "Unemployed", "Alternate Title": "N/A"},
        {"Title": "Unemployed", "Alternate Title": "No Employment"},
        {"Title": "Unemployed", "Alternate Title": "Not Working"},
        {"Title": "Retired", "Alternate Title": "Retiree"},
    ]
)
onet_df = pd.concat([onet_df, additions], ignore_index=True).fillna("")
onet_df[(onet_df["Title"] == "Unemployed") | (onet_df["Title"] == "Retired")]
Title Alternate Title Short Title
56560 Unemployed Not Employed
56561 Unemployed None
56562 Unemployed N/A
56563 Unemployed No Employment
56564 Unemployed Not Working
56565 Retired Retiree

Now let’s merge these fields together. Since we’ll be leveraging a language model, we can take the liberties of language here; we don’t need clean many-to-many relationships. Just combine Title, Alternate Title, and when available, Short Title into one Long Title field with "aka" inline:

mask = onet_df["Short Title"].eq("")

onet_df["Long Title"] = np.where(
    mask,
    onet_df["Title"] + " aka " + onet_df["Alternate Title"],
    onet_df["Title"] + " aka " + onet_df["Alternate Title"] + " aka " + onet_df["Short Title"],
)

onet_df[onet_df["Title"] == "Software Developers"]
Title Alternate Title Short Title Long Title
4931 Software Developers .NET Developer Software Developers aka .NET Developer
4932 Software Developers Android Developer Software Developers aka Android Developer
4933 Software Developers AngularJS Developer Software Developers aka AngularJS Developer
4934 Software Developers Apache Hadoop Developer Software Developers aka Apache Hadoop Developer
4935 Software Developers Application Architect Software Developers aka Application Architect
... ... ... ... ...
5061 Software Developers User Interface Designer Software Developers aka User Interface Designer
5062 Software Developers Video Game Engineer Software Developers aka Video Game Engineer
5063 Software Developers Wide Area Network Engineer (WAN Engineer) WAN Engineer Software Developers aka Wide Area Network Engi...
5064 Software Developers Windows Software Engineer Software Developers aka Windows Software Engineer
5065 Software Developers XML Developer (Extensible Markup Language Deve... XML Developer Software Developers aka XML Developer (Extensi...

135 rows × 4 columns

Great. We’ve satisfied parts one and two of our approach. We have a finite set of job titles, and we have a good understanding that the set is large but not exhaustive, and it combines multiple values for a job title. Let’s start modeling language.

JobBERT-v2 is a sentence-transformers model fine tuned from all-mpnet-base-v2 specifically for job title matching and similarity. Hey that’s convenient!

We can’t use JobBERT out of the box, we’ll need to incorporate our O*NET dataset. Let’s pull it down and start building out our implementation of the model. To do this we’re going to leverage word embeddings against our Long Title values. If you’re unfamiliar with langage model embeddings, Simon Willison has a fantastic overview that you should go read now. But the gist of it is: embeddings are how language models numerically encode meaning from language into a large vector. This is suprisingly powerful, and yields operations like:

emb('king') - emb('man') + emb('woman') which returns a vector that is mathmatically very close to emb('queen').

We’re going to use this “closeness” between vectors to reduce infinite free-form data to our finite Long Title data and then map it back to Title. The first thing to do is quite simple: calculate JobBERT embeddings on all of the values in our Long Title column:

from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings("ignore")

model = SentenceTransformer("TechWolf/JobBERT-v2")

onet_df["embedding"] = list(
    model.encode(
        onet_df["Long Title"].tolist(),
        normalize_embeddings=True,
        convert_to_numpy=True,
        show_progress_bar=False,
    )
)

onet_df[["Long Title", "embedding"]].sample(n=5, random_state=101)
Long Title embedding
25268 Cargo and Freight Agents aka Shipping Agent [-0.034480397, -0.01120864, -0.005822623, -0.0...
30408 Helpers, Construction Trades, All Other aka Me... [-0.07022176, -0.020068161, 0.0111531215, -0.0...
6045 Bioengineers and Biomedical Engineers aka Biom... [-0.030613927, -0.059320696, -0.01718829, -0.0...
18684 Occupational Therapy Aides aka Rehabilitation ... [-0.014956265, -0.038792193, -0.00349255, 0.00...
24086 Billing and Posting Clerks aka Statement Servi... [0.020981414, -0.033710796, 0.03225505, -0.011...

As a reader, the embedding column is an indecipherable array of floats, but now we can do some cool things. Here are three rows from our data:

slice = onet_df[
    (onet_df["Long Title"] == "Software Developers aka Video Game Engineer") |
    (onet_df["Long Title"] == "Database Architects aka Information Architect") |
    (onet_df["Long Title"] == "Cargo and Freight Agents aka Shipping Agent")
]
slice[["Long Title", "embedding"]]
Long Title embedding
4806 Database Architects aka Information Architect [0.020110216, 0.06301323, -0.029263753, -0.022...
5062 Software Developers aka Video Game Engineer [0.041696787, 0.024444718, -0.053837907, 0.031...
25268 Cargo and Freight Agents aka Shipping Agent [-0.034480397, -0.01120864, -0.005822623, -0.0...

In a vector space, you can evaluate how similar two vectors are by taking their cosine similarity. Since we normalized our vectors when we embedded them, the denominator in the cosine function becomes 1 so we can do this even more efficiently with just a dot product:

database_architect = slice.iloc[0]["embedding"]
software_developer = slice.iloc[1]["embedding"]
cargo_and_freight_agent = slice.iloc[2]["embedding"]

print(f"Software Developer vs Data Architect: {software_developer @ database_architect}")
print(f"Software Developer vs Cargo Agent: {software_developer @ cargo_and_freight_agent}")
Software Developer vs Data Architect: 0.2380954772233963
Software Developer vs Cargo Agent: 0.08600345253944397

Those numbers look perfectly reasonable: a modest overlap (≈ 0.24) between two tech roles and an almost-orthogonal relationship (≈ 0.09) to the Cargo Agent job.

Great. So now we have a mathematical way to compare the language of two job titles. And we’re not touching chatbots or 3rd party APIs at inferrence. The DataFrame is a self-contained model for semantically matching across the O*NET dataset. We’ve fully satisfied our approach!

Now to apply it to our problem. Instead of evaluating O*NET data against itself, we can use our embeddings to evaluate any free-form job title text a user might submit.

Let’s go get some real data to try it out! Since I work in political tech, I like to reach for campaign donor data.

Our friends over at ProPublica publish itemized ActBlue receipts by state. Because ActBlue is a conduit committee, these files include every transaction of any amount. That’s a lot of transactions! Let’s grab all of the ActBlue transactions from Texas for June 2024.

dtypes = {
    "flag_orgind": "string",
    "first_name": "string",
    "city": "string",
    "zip": "string",
    "amount": "float64",
    "occupation": "string",
}

abtx_df = pd.read_csv(
    "https://pp-projects-static.s3.amazonaws.com/itemizer/sa_1791562_tx.csv",
    usecols=dtypes.keys(),
    dtype=dtypes,
)
abtx_df = abtx_df[abtx_df["flag_orgind"] == "IND"]
abtx_df.drop(columns=["flag_orgind"], inplace=True)
abtx_df.dropna(inplace=True)

employed = lambda df: df["occupation"].ne("NOT EMPLOYED")
get_sample = lambda seed: abtx_df.loc[employed(abtx_df)].sample(n=10, random_state=seed)

get_sample(30330)  # seed for reproducibility
first_name city zip amount occupation
139133 PAT NORTH RICHLAND HIL 76180 9.0 RETIRED
4248 HARRY HOUSTON 77019 125.0 LAWYER
221073 THAO HOUSTON 77083 2.5 HISTOLOGY TECHNICIAN
201669 PHILIP SAN ANTONIO 78240 3.0 MANAGER
256487 MARIA HOLLAND 76534 1.0 HEALTHCARE ADMINISTRATOR
6330 ELLEN BELLAIRE 77401 100.0 ARBITRATOR
125976 MICHELLE FORT WORTH 76133 10.0 SERVER
13659 ROSE HOUSTON 77024 75.0 NURSING
138213 PETER HOUSTON 77019 9.0 DATABASE ANALYST
62939 ERIN FRISCO 75035 25.0 NONPROFIT

That occupation field came from donors and doesn’t perfectly match our modeled job titles. But we don’t need it to! Let’s use our model to calculate embeddings on these new values:

abtx_df["embedding"] = list(
    model.encode(
        abtx_df["occupation"].tolist(),
        normalize_embeddings=True,
        convert_to_numpy=True,
        show_progress_bar=False,
    )
)

get_sample(30330)
first_name city zip amount occupation embedding
139133 PAT NORTH RICHLAND HIL 76180 9.0 RETIRED [-0.022304475, 0.08798518, 0.008374137, 0.0150...
4248 HARRY HOUSTON 77019 125.0 LAWYER [0.0123939905, 0.06054912, 0.0046267705, -0.03...
221073 THAO HOUSTON 77083 2.5 HISTOLOGY TECHNICIAN [0.027493875, -0.079993084, 0.013278877, -0.00...
201669 PHILIP SAN ANTONIO 78240 3.0 MANAGER [0.09953636, 0.07623968, 0.020005615, 0.001361...
256487 MARIA HOLLAND 76534 1.0 HEALTHCARE ADMINISTRATOR [0.036386397, 0.06352263, -0.0023324555, -0.02...
6330 ELLEN BELLAIRE 77401 100.0 ARBITRATOR [0.05744064, 0.03044543, -0.011071598, 0.01120...
125976 MICHELLE FORT WORTH 76133 10.0 SERVER [0.0222358, 0.025343752, -0.027377797, -0.0031...
13659 ROSE HOUSTON 77024 75.0 NURSING [0.008113306, 0.03859657, -0.014793488, -0.064...
138213 PETER HOUSTON 77019 9.0 DATABASE ANALYST [0.042439297, 0.08193651, -0.027909847, 0.0143...
62939 ERIN FRISCO 75035 25.0 NONPROFIT [-0.052989695, 0.19739158, 0.007059781, 0.0288...

So now we have two sets of embeddings: we have our O*NET embeddings and we have our ActBlue donor embeddings. Just as before, we can calculate similaries between them. But unlike before, we need to calculate a lot. In order to find the best match we need to compare every O*NET embedding vector with every ActBlue embedding vector. That’s a lot of comparisons. The good news, this is what GPUs are good at, and a free-tier GPU in Google Colab can kick this out fast.

We convert our O*NET embedding column into a (n × d) tensor, where n is the number of rows and d is the vector length. Similarly, we convert the ActBlue embedding column into a (m × d) tensor where m is the number of ActBlue rows.

When pushing this to a GPU, it’s a little more art than science. We batch it, and picking an optimal batch size can take some trial and error. For every batch, we’ll calculate the dot product, and return the indices of the best similarities.

From there, we can map all the way back to our original O*NET Title column, as our normalized output:

import torch

device = "cuda"  # requires an NVIDIA GPU + CUDA
onet_t = torch.tensor(np.stack(onet_df.embedding), device=device)  # (n × d)
abtx_t = torch.tensor(np.stack(abtx_df.embedding), device=device)  # (m × d)

batch = 4096  # tune to fit GPU RAM
best = []

with torch.no_grad():
    for s in range(0, abtx_t.size(0), batch):
        sims = abtx_t[s:s+batch] @ onet_t.T  # (batch × n)
        best.append(sims.argmax(dim=1).cpu())

idx = torch.cat(best).numpy()

abtx_df["Normalized Occupation"] = onet_df.Title.iloc[idx].to_numpy()

get_sample(30330)[["first_name", "occupation", "Normalized Occupation"]]
first_name occupation Normalized Occupation
139133 PAT RETIRED Retired
4248 HARRY LAWYER Lawyers
221073 THAO HISTOLOGY TECHNICIAN Histology Technicians
201669 PHILIP MANAGER Managers, All Other
256487 MARIA HEALTHCARE ADMINISTRATOR Medical and Health Services Managers
6330 ELLEN ARBITRATOR Arbitrators, Mediators, and Conciliators
125976 MICHELLE SERVER Food Servers, Nonrestaurant
13659 ROSE NURSING Registered Nurses
138213 PETER DATABASE ANALYST Database Administrators
62939 ERIN NONPROFIT Fundraisers

A way to visualized this is with principal component analysis. PCA computes new orthogonal axes called principal components that capture the most variation in the data. These directions are combinations of the original dimensions, chosen to reveal the biggest patterns and differences. By projecting each vector onto the first two principal components, we can plot everything in two dimensions while keeping as much of the original structure as possible:

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

X_onet = np.vstack(onet_df["embedding"].to_numpy())
X_abtx = np.vstack(abtx_df["embedding"].to_numpy())
X_all  = np.vstack([X_onet, X_abtx])

pca = PCA(n_components=2)
proj = pca.fit_transform(X_all)
coords_onet = proj[: len(X_onet)]
coords_abtx = proj[len(X_onet) :]

plt.figure(figsize=(16, 9))
plt.scatter(coords_onet[:, 0], coords_onet[:, 1], alpha=0.6, label="O*NET job titles")
plt.scatter(coords_abtx[:, 0], coords_abtx[:, 1], alpha=0.6, label="Donor-entered occupations")
plt.title("PCA of Occupation Embeddings: O*NET vs Donor-entered")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend()
plt.tight_layout()
plt.show()

And in our higher-dimension space, we can calculate ad hoc similarities on any arbitrary job title someone might enter:

def normalize(job_title):
  ad_hoc = torch.tensor(
      model.encode(
          [job_title],
          normalize_embeddings=True,
          convert_to_numpy=True,
          show_progress_bar=False,
      ),
      device=device,
  )
  idx = int((ad_hoc @ onet_t.T).argmax(dim=1))
  return f'{job_title}{onet_df.at[idx, "Title"]}'

print(normalize("code ninja"))
print(normalize("uber"))
print(normalize("GM"))
print(normalize("Professor, Artist"))
print(normalize("Postdoctoral Fellow"))
print(normalize("DARE Officer"))
print(normalize("Senator"))
print(normalize("Comms Director"))
print(normalize("dermatology"))
print(normalize("commodities trader"))
print(normalize("life insurance adjuster"))
print(normalize("lumberjack"))
print(normalize("cpa"))
code ninja → Computer Programmers
uber → Taxi Drivers
GM → General and Operations Managers
Professor, Artist → Art, Drama, and Music Teachers, Postsecondary
Postdoctoral Fellow → Clinical Research Coordinators
DARE Officer → Police and Sheriff's Patrol Officers
Senator → Legislators
Comms Director → Public Relations Managers
dermatology → Dermatologists
commodities trader → Securities, Commodities, and Financial Services Sales Agents
life insurance adjuster → Claims Adjusters, Examiners, and Investigators
lumberjack → Fallers
cpa → Accountants and Auditors

It’s not perfect. I don’t think I would have normalized Postdoctoral Fellow to Clinical Research Coordinators. This is where decisions we made in constructing the model start to show themselves. All those aka job title concatenations we did at the top affected how the language was modeled. Depending on your use case this could be fine; maybe you just need deterministic finite categories. You can play with the embedding calculation process to see how different strategies yield different resules.

As for our nurses:

print(normalize("nurse"))
print(normalize("RN"))
print(normalize("MSN"))
print(normalize("ER Nurse"))
print(normalize("home health nursing"))
nurse → Registered Nurses
RN → Registered Nurses
MSN → Registered Nurses
ER Nurse → Registered Nurses
home health nursing → Registered Nurses

Language models are pretty good at modeling language!