PhD Thesis: NLP for Uncovering the Illicit Massage Industry
PhD thesis (Harvard, Sep 2023) applying NLP and finetuned BERT models to classify online massage parlor reviews, enabling geographic analysis of illicit activity to support counter-trafficking research.
PhD Thesis Defense — Harvard SEAS, September 2023
How can computer science be used to fight human trafficking? I use natural language processing (NLP) to monitor the United States illicit massage industry (IMI), a multi-billion dollar industry that offers not just therapeutic massages but also commercial sexual services. Employees of this industry are often immigrant women with few job opportunities, leaving them vulnerable to fraud, coercion, and other facets of human trafficking. Monitoring spatiotemporal trends helps prevent trafficking in the IMI. By creating datasets with three publicly-accessible websites: Google Places, Rubmaps, and AMPReviews, combined with NLP techniques such as bag-of-words and Word2Vec, I show how to derive insights into the labor pressures and language barriers that employees face, as well as the income, demographics, and societal pressures affecting sex buyers. I include a call-to-action to other researchers given these datasets.
A subset of massage parlors have illicit activity alongside legitimate services. These locations accrue reviews on both mainstream platforms (Google Maps) and niche platforms (Rubmaps). By training a classifier to extrapolate from the more-labeled niche dataset to the mainstream dataset, researchers can estimate the geographic distribution and intensity of illicit activity — supporting counter-trafficking investigation and policy.
Review-volume analysis across 2020 — illicit-activity review patterns shifted noticeably during COVID-19 shutdowns.
What I did: Solo work using Python and the SimpleTransformers library to finetune BERT models on this NLP binary classification task. Standard data science stack (pandas, numpy, matplotlib, seaborn, scipy, sklearn) for exploratory analysis, including review-volume time series across 2020. Data from Heyrick Research courtesy of IBM.




PhD advisor: Prof. Roberto Rigobon. Prior collaborations with Prof. Robert Platt (Northeastern) and Prof. Robert Howe (Harvard).