December 16, 2020
Developing and employing natural language processing (NLP) models in industry has become progressively more challenging as model complexity increases, data sets grow in size, and computational requirements rise. These hurdles limit many organizations’ ability to access and leverage NLP capabilities, putting their significant benefits out of reach.
To help overcome them, beginning in June 2019 Vector ran a collaborative project with industry sponsors and researchers to help companies learn how to recreate NLP models for deployment within their businesses. The Recreation of Large Scale Pre-Trained Language Models project (the NLP Project) familiarized participating industry sponsors with advanced NLP techniques, as well as the workflows for developing new methods that can achieve high performance while using relatively small data sets and widely accessible computing resources.
Whereas most NLP research collaborations are designed to produce state-of-the-art models with competitively low error rates, the objective of the NLP project was to create a collaborative and scalable learning environment that would allow multiple companies to gain the hands-on experience necessary to create and scale NLP models whose primary objective is to produce business value. As such, the project involved 60 participants: 23 Vector researchers and staff with expertise in machine learning and NLP along with 37 industry technical professionals from 16 Vector industry sponsor companies. The participants established 11 working groups, each of which developed and performed experiments relevant to existing industry needs. Additionally, at the beginning of the COVID19 pandemic, a special interest group (SIG-Kaggle-COVID19) was established with the objective of developing question answering approaches that can help the medical community develop answers to high priority scientific questions.
With the aim of helping other organizations build, deploy and gain value from the project, the Vector Institute, together with project participants, presented their findings and insights in a technical report and symposium:
- The NLP Project Technical Report – “Harnessing the Power of Natural Language Processing (NLP): A Vector Institute Industry Collaborative Project”
- The NLP Symposium, September 15-16, 2020 – a two-day virtual meeting featuring presentations and hands-on workshops, delivered by the project participants and Vector researchers. Keynote speakers included He He, Assistant Professor, Computer Science and Data Science, New York University; Khalid Al-Kofahi, Senior Vice President and Head of AI Personal Investments, Fidelity and Vector Faculty Members Jimmy Ba, Gennady Pekhimenko, and Frank Rudzicz.
Published work based on research from the NLP project and presented at the NLP symposium:
- Multi-node Bert-pretraining: Cost-efficient Approach
- Smart System to Generate and Validate Question Answer Pairs for COVID-19 Literature
- Multi-Node Training of Large Scale Language Models
- Non-Pharmaceutical Intervention Discovery with Topic Modeling
- Customizing Contextualized Language Models for Legal Document Reviews
- A Partial Replication of Language Representation in the Biomedical Domain, Evolution of Deep Learning Symposium, Poster Presentation
- An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain
- Harnessing the Power of NLP: A Vector Institute Industry Collaborative Project
- Kaggle CORD19 Dataset Challenge Submissions:
Taken together, through the NLP Project, industry participants benefited by gaining experience with pre-training of large scale language models, attending expert lectures leading to effective knowledge transfer, accessing Vector’s scientific computing resources, establishing fruitful collaborations with other sponsors organizations, and using their domain expertise to accelerate the dissemination of scientific knowledge and help the medical community in the fight against COVID-19. Notably, insights gained in the NLP Project have informed programs and product development in some participating organizations.