How Thomson Reuters uses NLP to enable knowledge workers to make faster and more accurate business decisions
April 2, 2020
April 2, 2020
April 2, 2020
Hidden Injustice, a recent investigative report by Reuters, revealed how federal civil court rulings obscured the role that pharmaceutical companies played in the rising opioid epidemic. It was ground-breaking for what reporters uncovered, but also for how they uncovered it. They used machine learning and natural language processing (NLP) to review 3.2 million federal civil suits and over 90 million court actions to identify material filed under seal. They could then narrow their focus to search for instances in which “public health and safety information was kept secret without explanation.”
The series won first prize at the 2019 Philip Meyer Journalism Awards, which recognize works of “precision journalism, computer-assisted reporting, and social science research.”
For Thomson Reuters — the parent company of the news organization — AI techniques like machine learning and NLP are at the heart of what it does best: enabling legal, media, and tax & accounting professionals to find information, understand it, and use it to make decisions.
Khalid Al-Kofahi, who previously headed Thomson Reuters’ Center for AI and Cognitive Computing, says, “Generally speaking, knowledge workers such as attorneys and accountants essentially do three things. They have information needs so they engage in a journey of research and discovery. As they do that, they start analyzing it to understand it. Then at some stage they move to some sort of action or decision. We use AI technology to support all these activities.”
Thomson Reuters’ own journey of research and discovery included sponsoring the Vector Institute, which it did for three reasons: to stay on the front line of fundamental research, to support Canada’s AI ecosystem, and to develop approaches to common AI challenges through collaboration with other industry players.
One collaboration highlight is the Vector consortium project on NLP, a technique used to pursue “the holy grail” of AI: the fluent understanding of language. This project involves 25 industry participants that work with Vector researchers in workstreams focused on various NLP-related experiments. Thomson Reuters participated in a workstream to cost-effectively replicate BERT — bidirectional encoder representations from transformers — an advanced language representation model. Creating BERT requires a deep neural network to be pre-trained on a large body of unlabeled text — like that found on Wikipedia, Twitter, or a news site — to create a general model of how language works. This pre-trained BERT can then be fine-tuned for tasks like machine translation, sentiment analysis, and question answering in specific domains like law, health, and finance.
This utility often comes at quite a cost, though. Pre-training a BERT typically requires days of processing on hardware that may be prohibitively expensive for most organizations to access. Fine-tuning a vendor’s pre-trained BERT on specialized cloud-based processors like graphics processing units (GPUs) or tensor processing units (TPUs) is much less demanding, but still often comes at significant time and expense.
“When you look at some of these language models, they require a huge amount of resources to build,” Al-Kofahi says. “Part of the challenge for us was: Can we train these models using more distributed architectures and figure out algorithms that can reduce the demand for many GPUs?” The first phase of experiments, run on Vector’s own GPU cluster, were promising.
According to Al-Kofahi, this consortium project is “a tide that lifts all boats,” since participants gain benefits without having to risk competitive edges. He explains, “This is an area where it makes a lot of sense for industry to collaborate because we are establishing solutions for horizontal problems: how to scale deep learning models. Then each one of us, once we figure out a solution to that problem, can take that and adapt these models.”
Al-Kofahi continues, “We took these learnings and adapted them to different domains. We have BERT for legal, BERT for tax, BERT for other domains as well, and we are now exploring how to incorporate some of these models for some of our products. This is a win-win situation.”
One product for which the results show potential is WestLaw, Thomson Reuters’ legal research service suite and the technology that enabled Reuters journalists to analyze millions of legal documents for Hidden Justice. It’ll soon also play a key role in a much broader judicial arena: Thomson Reuters was recently chosen by the Administrative Office of the U.S. Courts to provide legal research tools to the Federal Judiciary, including the Supreme Court and federal public defenders.
These awards illustrate one of the potential benefits of pursuing new AI insights and staying close to the leading edge of AI research: the development of technology that increases justice and enhances access to it.
In 2017, the Congressionally-established non-profit, Legal Services Corporation, released a report declaring that “A lack of available resources accounts for the vast majority of eligible civil legal problems that go unserved or underserved,” and that “insufficient resources account for between 85% and 97% of all unserved or underserved eligible problems.”
“How can we improve access?” Al-Kofahi says. “There are significant opportunities to use AI and machine learning to improve matter intake, provide resolutions that are aided by an arbitrator downstream, and so on. I think AI in that sense will transform the legal industry and how legal services are provided.”
He adds, “We are already part of the transformation.”
 Reuters Investigates. Hidden Injustice. How we did the data analysis. www.reuters.com/investigates/special-report/usa-courts-secrecy-how/
 Investigative Reporters & Editors. The Philip Meyer Awards. https://www.ire.org/awards/philip-meyer-awards/
 Legal Services Corporation. The Justice Gap: Measuring the Unmet Civil Legal Needs of Low-income Americans. 2017. Pg. 44