I'm the technical founder and CTO at Stealth, where we are building next-generation machine reading technology to help people understand complex documents in minutes instead of hours.
We are translating insights and capabilities from the state-of-the-art in machine learning and natural language processing research into a real system that empowers domain experts and organizations to be more effective in a valuable legaltech / fintech market.
We are venture-backed and are actively seeking to add skilled and motivated frontend, backend, and ml/nlp engineers to our team, so if that sounds interesting, feel free to reach out!
For my thesis, I researched semi-supervised methods for making information extraction models as annotation-efficient as possible using prinicipled statistical methods for applying subject-matter expertise as weak supervision and utilizing cheaper biased data. I'm generally interested in all aspects surrounding building automated systems for generating structured knowledge bases from text in atypical or low-resource situations. I'm advised by Michael Collins and defended in October 2022.
Previously I worked on a decision support system that utilizes text mining to aid NYC Deparment of Health epidemiologists in tracking actionable indicators of foodborne illness from Yelp reviews.
Accepted to TACL (currently preprint), I proposed an approach for improving cross-lingual syntax parsers in low resource languages by regularizing many simple aspects of their behaviors on target languages with differentiable descriptive statistics. Called Expected Statistic Regularization, this general method uses things like marginal statistics, entropies, and other low-order descriptive statistics to keep the model from making erratic and implausible errors on new, out-of-domain languages, a very common error mode for pretraining+transfer learning approaches. It yields impressive results and I'm excited to try it out on other structured inference problems.
Published in TACL 2021, I proposed an approach for learning Named Entity Recognition models when the data has incomplete, low-recall annotations. Called the Expected Entity Ratio, the method corrects for missing spans using a principled latent-variable approach, coupled with an additional loss term that guides it to have the expected number of entities. The cool thing is that it enables annotators to move fast and leave repetitive or long documents unfinished without suffering a performance hit. In fact we show that it can actually be more annotation-efficient to label documents this way when the annotation budget is modest.
Many people disclose incidents foodborne illness on Yelp and Twitter. In collaboration with the NYC Department of Health and Mental Hygiene, we built a document classification system that mines Yelp and Twitter for reports of foodborne illness to facilitate the targeted investigation of foodborne illness outbreaks.
Using a mixture of unsupervised document similarity methods, we recommend relevant posts in medical discussion forums to help users find the information they're seeking faster. We're currently developing a chrome extension to put the algorithm into practice.