Hi! I'm Tom. I build complex machine reading systems.
I'm co-founder and CTO at Noetica AI,
and before that I did a ML/NLP PhD at Columbia.

At a high level, I'm interested in how machine reading
can help us make sense of vast oceans of text
so that we can be better informed and make smarter choices.

Noetica AI


I'm the technical founder and CTO at Noetica AI, where we are building next-generation machine reading technology to help stakeholders understand populations of complex contracts in minutes instead of hours and surface previously inaccessible insights.

We are translating insights and capabilities from the state-of-the-art in machine learning and natural language processing research into a real system that empowers domain experts and organizations to be more effective in a valuable legaltech / fintech market.

We are always looking to add skilled and motivated full-stack and ml/nlp engineers to our team, so if that sounds interesting, feel free to reach out!

Research


For my thesis, I researched semi-supervised methods for making information extraction models as annotation-efficient as possible using prinicipled statistical methods for applying subject-matter expertise as weak supervision and utilizing cheaper biased data. I'm generally interested in all aspects surrounding building automated systems for generating structured knowledge bases from text in atypical or low-resource situations. I was advised by Michael Collins and defended in October 2022.

Before that I worked on a cutting-edge decision support system that utilized text mining to aid NYC Deparment of Health epidemiologists in tracking actionable indicators of foodborne illness from Yelp reviews. During this project I was advised by Luis Gravano and Daniel Hsu.

Selected Projects


Expected Statistic Regularization

Published in TACL 2023, I proposed an approach for improving cross-lingual syntax parsers in low resource languages by regularizing many simple aspects of their behaviors on target languages with differentiable descriptive statistics. Called Expected Statistic Regularization, this general method uses things like marginal statistics, entropies, and other low-order descriptive statistics to keep the model from making erratic and implausible errors on new, out-of-domain languages, a very common error mode for pretraining+transfer learning approaches. It yields impressive results and I'm excited to try it out on other structured inference problems.

Partially Supervised NER

Published in TACL 2021, I proposed an approach for learning Named Entity Recognition models when the data has incomplete, low-recall annotations. Called the Expected Entity Ratio, the method corrects for missing spans using a principled latent-variable approach, coupled with an additional loss term that guides it to have the expected number of entities. The cool thing is that it enables annotators to move fast and leave repetitive or long documents unfinished without suffering a performance hit. In fact we show that it can actually be more annotation-efficient to label documents this way when the annotation budget is modest.

Detecting Foodborne Illness on Social Media

Joint work with Lampros Flokas, Yogesh Garg, and Anna Lawson

Advised by Luis Gravano and Daniel Hsu

Many people disclose incidents foodborne illness on Yelp and Twitter. In collaboration with the NYC Department of Health and Mental Hygiene, we built a document classification system that mines Yelp and Twitter for reports of foodborne illness to facilitate the targeted investigation of foodborne illness outbreaks.

The Posts Recommendation Algorithm for dExplorer

Joint work with Drashko Nakikj

Using a mixture of unsupervised document similarity methods, we recommend relevant posts in medical discussion forums to help users find the information they're seeking faster. We're currently developing a chrome extension to put the algorithm into practice.

Publications


  • Thomas Effland and Michael Collins. Improving Low-Resource Cross-Lingual Parsing with Expected Statistic Regularization. Transactions of the Association for Computational Linguistics (TACL), January 2023
  • Thomas Effland and Michael Collins. Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss. Transactions of the Association for Computational Linguistics (TACL), December 2021
  • T. Effland, Anna Lawson, Sharon Balter, Katelynn Devinney, Vasuhda Reddy, Luis Gravano, Daniel Hsu. Discovering Foodborne Illness in Online Restaurant Reviews. Journal of the American Medical Informatics Association (JAMIA), Volume 25, Issue 12, 1 December 2018, Pages 1586 - 1592
    Selected Press ->
  • T. Effland. Focused Retrieval of University Course Descriptions from Highly Variable Sources. In ACM Student Research Competition Grand Finals, 2015.
    First Place Award
  • J. Hartloff, M. Morse, B. Zhang, T. Effland, J. Cordaro, J. Schuler, S. Tulyakov, A. Rudra, V. Govindaraju. A Multiple Server Scheme for Fingerprint Fuzzy Vaults. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015.
  • M. Morse, J. Hartloff, T. Effland, J. Schuler, J. Cordaro, S. Tulyakov, A. Rudra, V. Govindaraju. Secure Fingerprint Matching With Generic Local Structures. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2014.
  • T. Effland, M. Schneggenburger, J. Schuler, B. Zhang, J. Hartloff, J. Dobler, S. Tulyakov, A. Rudra, V. Govindaraju. Secure Fingerprint Hashes Using Subsets of Local Structures. In Proc. SPIE 9075-12, Biometric and Surveillance Technology for Human and Activity Identification XI, 90750D, 2014.

Resources


  • Material from my PhD candidacy exam on the intersection of deep learning and structured probabilistic models for NLP. You can find my overview and slides here
  • Some slides I gave in a Deep Generative Models seminar on using conditional random fields in variational autoencoders
  • My defense slides
  • My dissertation