Hi! I'm Tom. I build complex machine reading systems.
I'm co-founder and CTO at Noetica AI,
and before that I did a ML/NLP PhD at Columbia.
At a high level, I'm interested in how machine reading
can help us make sense of vast oceans of text
so that we can be better informed and make smarter choices.
I'm the technical founder and CTO at Noetica AI, where we are building next-generation machine reading technology to help stakeholders understand populations of complex contracts in minutes instead of hours and surface previously inaccessible insights.
We are translating insights and capabilities from the state-of-the-art in machine learning and natural language processing research into a real system that empowers domain experts and organizations to be more effective in a valuable legaltech / fintech market.
We are always looking to add skilled and motivated full-stack and ml/nlp engineers to our team, so if that sounds interesting, feel free to reach out!
For my thesis, I researched semi-supervised methods for making information extraction models as annotation-efficient as possible using prinicipled statistical methods for applying subject-matter expertise as weak supervision and utilizing cheaper biased data. I'm generally interested in all aspects surrounding building automated systems for generating structured knowledge bases from text in atypical or low-resource situations. I was advised by Michael Collins and defended in October 2022.
Before that I worked on a cutting-edge decision support system that utilized text mining to aid NYC Deparment of Health epidemiologists in tracking actionable indicators of foodborne illness from Yelp reviews. During this project I was advised by Luis Gravano and Daniel Hsu.
Published in TACL 2023, I proposed an approach for improving cross-lingual syntax parsers in low resource languages by regularizing many simple aspects of their behaviors on target languages with differentiable descriptive statistics. Called Expected Statistic Regularization, this general method uses things like marginal statistics, entropies, and other low-order descriptive statistics to keep the model from making erratic and implausible errors on new, out-of-domain languages, a very common error mode for pretraining+transfer learning approaches. It yields impressive results and I'm excited to try it out on other structured inference problems.
Published in TACL 2021, I proposed an approach for learning Named Entity Recognition models when the data has incomplete, low-recall annotations. Called the Expected Entity Ratio, the method corrects for missing spans using a principled latent-variable approach, coupled with an additional loss term that guides it to have the expected number of entities. The cool thing is that it enables annotators to move fast and leave repetitive or long documents unfinished without suffering a performance hit. In fact we show that it can actually be more annotation-efficient to label documents this way when the annotation budget is modest.
Many people disclose incidents foodborne illness on Yelp and Twitter. In collaboration with the NYC Department of Health and Mental Hygiene, we built a document classification system that mines Yelp and Twitter for reports of foodborne illness to facilitate the targeted investigation of foodborne illness outbreaks.
Using a mixture of unsupervised document similarity methods, we recommend relevant posts in medical discussion forums to help users find the information they're seeking faster. We're currently developing a chrome extension to put the algorithm into practice.