Hi! I'm Tom. I build complex machine reading systems.
I'm a graduating PhD student from Columbia
and the technical co-founder of an AI startup.

At a high level, I'm interested in how machine reading
can help us make sense of vast oceans of text
so that we can be better informed and make smarter choices.

Startup


I'm the technical founder and CTO at Stealth, where we are building next-generation machine reading technology to help people understand complex documents in minutes instead of hours.

We are translating insights and capabilities from the state-of-the-art in machine learning and natural language processing research into a real system that empowers domain experts and organizations to be more effective in a valuable legaltech / fintech market.

We are venture-backed and are actively seeking to add skilled and motivated frontend, backend, and ml/nlp engineers to our team, so if that sounds interesting, feel free to reach out!

Research


For my thesis, I researched semi-supervised methods for making information extraction models as annotation-efficient as possible using prinicipled statistical methods for applying subject-matter expertise as weak supervision and utilizing cheaper biased data. I'm generally interested in all aspects surrounding building automated systems for generating structured knowledge bases from text in atypical or low-resource situations. I'm advised by Michael Collins and defended in October 2022.

Previously I worked on a decision support system that utilizes text mining to aid NYC Deparment of Health epidemiologists in tracking actionable indicators of foodborne illness from Yelp reviews.

Selected Projects


Expected Statistic Regularization

Accepted to TACL (currently preprint), I proposed an approach for improving cross-lingual syntax parsers in low resource languages by regularizing many simple aspects of their behaviors on target languages with differentiable descriptive statistics. Called Expected Statistic Regularization, this general method uses things like marginal statistics, entropies, and other low-order descriptive statistics to keep the model from making erratic and implausible errors on new, out-of-domain languages, a very common error mode for pretraining+transfer learning approaches. It yields impressive results and I'm excited to try it out on other structured inference problems.

Partially Supervised NER

Published in TACL 2021, I proposed an approach for learning Named Entity Recognition models when the data has incomplete, low-recall annotations. Called the Expected Entity Ratio, the method corrects for missing spans using a principled latent-variable approach, coupled with an additional loss term that guides it to have the expected number of entities. The cool thing is that it enables annotators to move fast and leave repetitive or long documents unfinished without suffering a performance hit. In fact we show that it can actually be more annotation-efficient to label documents this way when the annotation budget is modest.

Detecting Foodborne Illness on Social Media

Joint work with Lampros Flokas, Yogesh Garg, and Anna Lawson

Advised by Luis Gravano and Daniel Hsu

Many people disclose incidents foodborne illness on Yelp and Twitter. In collaboration with the NYC Department of Health and Mental Hygiene, we built a document classification system that mines Yelp and Twitter for reports of foodborne illness to facilitate the targeted investigation of foodborne illness outbreaks.

The Posts Recommendation Algorithm for dExplorer

Joint work with Drashko Nakikj

Using a mixture of unsupervised document similarity methods, we recommend relevant posts in medical discussion forums to help users find the information they're seeking faster. We're currently developing a chrome extension to put the algorithm into practice.

Publications


  • Thomas Effland and Michael Collins. Improving Low-Resource Cross-Lingual Parsing with Expected Statistic Regularization. Transactions of the Association for Computational Linguistics (TACL), preprint
  • Thomas Effland and Michael Collins. Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss. Transactions of the Association for Computational Linguistics (TACL), December 2021
  • T. Effland, Anna Lawson, Sharon Balter, Katelynn Devinney, Vasuhda Reddy, Luis Gravano, Daniel Hsu. Discovering Foodborne Illness in Online Restaurant Reviews. Journal of the American Medical Informatics Association (JAMIA), Volume 25, Issue 12, 1 December 2018, Pages 1586 - 1592
    Selected Press ->
  • T. Effland. Focused Retrieval of University Course Descriptions from Highly Variable Sources. In ACM Student Research Competition Grand Finals, 2015.
    First Place Award
  • J. Hartloff, M. Morse, B. Zhang, T. Effland, J. Cordaro, J. Schuler, S. Tulyakov, A. Rudra, V. Govindaraju. A Multiple Server Scheme for Fingerprint Fuzzy Vaults. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015.
  • M. Morse, J. Hartloff, T. Effland, J. Schuler, J. Cordaro, S. Tulyakov, A. Rudra, V. Govindaraju. Secure Fingerprint Matching With Generic Local Structures. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2014.
  • T. Effland, M. Schneggenburger, J. Schuler, B. Zhang, J. Hartloff, J. Dobler, S. Tulyakov, A. Rudra, V. Govindaraju. Secure Fingerprint Hashes Using Subsets of Local Structures. In Proc. SPIE 9075-12, Biometric and Surveillance Technology for Human and Activity Identification XI, 90750D, 2014.

Resources


  • Material from my PhD candidacy exam on the intersection of deep learning and structured probabilistic models for NLP. You can find my overview and slides here
  • Some slides I gave in a Deep Generative Models seminar on using conditional random fields in variational autoencoders
  • My defense slides
  • My dissertation