Thomson Reuters: GenAI Tool Tested By Stanford ‘Leverages Innovation in Casetext’ – Updated

Thomson Reuters (TR) has confirmed that the AI-Assisted Research tool that Stanford HAI tested for accuracy ‘leverages innovation in Casetext’. This matters because TR paid $650m for the company – and the study showed the genAI tool generated high levels of hallucinations. They have also finally answered other key questions put to them by Artificial Lawyer (see Q&A below).

This is what they said on Thursday evening (June 13):

‘AI-Assisted Research had already been in development before the acquisition of Casetext’s CoCounsel. It leverages innovation in Casetext using a “best of” approach across product development.

Later this evening (Friday, June 14), TR also sent some additional comments, which this site has included here:

‘None of the technology in Casetext’s CoCounsel is part of AI-Assisted Research, so it was not tested or a component of the research from Stanford. As to the “best of” approach, that is part of the Thomson Reuters Generative AI Platform – a common development platform that enables Thomson Reuters to design, build, and deploy reusable GenAI components that become the building blocks for future skills and products.’

However, after speaking to TR about this tonight they confirmed that: ‘Innovations from Casetext are in AI-AR through the Generative AI platform, however, CoCounsel is not.’ (I.e. the original Casetext CoCounsel from 2023).

But, this may all be a moot point, as TR then went on to explain (AL’s italics) to this site that now: ‘There is no genAI case law product called CoCounsel anymore. CoCounsel as acquired [when TR bought Casetext] doesn’t exist any longer.’

‘And, CoCounsel Core [the new legal genAI assistant] does not include case law research. So if you want genAI case law research, then you have to use AI AR.’

And it was AI AR that Stanford studied.

I.e. we are back to the beginning to some degree when it comes to genAI and case law research, in that if you want to use the genAI case law research ability with Westlaw, then you should use AI AR, which as TR has noted ‘leverages innovations in Casetext’.

One further point that TR’s spokesperson stressed is that as things currently stand, the Casetext team of engineers that came with the acquisition, and TR’s engineers that are working on genAI, are all blended together. So, their output of genAI ideas and innovations, which flows through the Thomson Reuters Generative AI Platform, are blended together as well.

Why does this matter? The Stanford HAI study found that the AI Assisted Research (AI-AR) tool for Westlaw had an accuracy rate of only 42% and an overall hallucination rate of 33% – see study. Meanwhile TR told this site it reached ‘approximately 90%’ accuracy.

The Stanford HAI study that started all of this.

TR completed its $0.65 billion acquisition of Casetext in August 2023, long before the Stanford tests were conducted in late Spring of this year. So, understandably, the legal tech community wanted to know if Casetext had contributed to the tool that was tested in any way and had – according to the Stanford HAI study – resulted in a low accuracy score. And the answer is: yes, it was, because ‘AI-Assisted Research…leverages innovation in Casetext using a “best of” approach across product development’.

And, as noted above: ‘There is no genAI case law product called CoCounsel anymore….So if you want genAI case law research, then you have to use AI AR.’

So, while it may well be correct to say that ‘CoCounsel’ for case law research as it arrived from Casetext is ‘not part’ of AI AR, it’s very hard to entirely separate all of Casetext’s innovations and their team’s ideas that were fed into TR after the acquisition.

This site is glad that this has been answered and then further added to, now onto the main Q&A, which TR sent to this site on Thursday evening. This covers issues such as: ‘is 90% accuracy good enough?’, testing methodologies, benchmarking, and other key issues.

How is TR testing for accuracy? How is it testing for hallucinations? How different is the methodology of TR and Stanford HAI?

At Thomson Reuters, we test and score the end-to-end solution across a variety of dimensions, helpfulness of the answers, comprehensiveness on the issue, relevance to the question asked, etc. Prior to releasing new products, and once they are released, we test rigorously with hundreds of real-world legal research questions, where two lawyers graded each result, and a third, more senior lawyer, resolved any disagreements in grading.

We also test with customers prior to release. In our testing their feedback was extraordinarily positive 

As for difference in methodology, the biggest differences are that we only test with real-world the type of legal research questions our customers face every day, and we look at the overall helpfulness of answers, and the potential harm of inaccuracies.

It would appear that TR believes Stanford HAI is wrong in its conclusions. Could you please comment on that. I.e. you welcome their study, but you suggest they are wrong.

While we agree with the spirit of the study, we were quite surprised when we saw the results of the Stanford research team’s review. Our testing shows approximately 90% accuracy when the solutions is measured using the type of questions our customer ask. In partnership with our customers, we continue to improve our solutions as the technology evolves.

It’s clear that Casetext is a core part of the genAI products that were tested. Can you please explain exactly what ‘AI-Assisted Research within Westlaw Precision’ is and where Casetext (and CoCounsel) fits into this?

It provides relevant answers to research questions with links to trusted Westlaw authority. AI-Assisted Research had already been in development before the acquisition of Casetext’s CoCounsel.

We are taking a “best of” approach across product development to build on the expertise of our combined teams – generative AI experts, data scientists, software engineers, product developers and more – to deliver industry-leading technology and solutions that will help our customers accelerate legal research and drafting.

CoCounsel, is the brand name for Thomson Reuters single GenAI assistant. It will become the one interface for our customers to access all our solutions. CoCounsel Core is the updated offering of what was Casetext’s CoCounsel.

[ And as noted above, they later said: ‘AI-Assisted Research had already been in development before the acquisition of Casetext’s CoCounsel. It leverages innovation in Casetext using a “best of” approach across product development. ]

Is 90% accuracy ‘good enough’? You say that AI is there to help lawyers find the answers, but not replace them, but if a product is wrong, doesn’t this make matters worse?

Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it. We’ve shared on our website and in training that AI is an accelerant for thorough research, not a replacement, being very clear with customers that the product can produce inaccuracies. Also, often inaccuracies in answers appear alongside the right answer. Hallucination rates without this examination imply entirely wrong answers, but in our testing, when we do see inaccuracies, they frequently appear alongside the right answer to the question. However, even when inaccuracies appear, reviewing the primary law and using tools like KeyCite or statutes annotations, inaccurate answers can typically be identified and mitigated quickly.

When used as designed Westlaw’s AI-Assisted Research helps legal researchers do their work both faster and better, even with occasional errors as they are not inaccuracies that would be harmful to practitioners running real-world legal research. We tell our customers the answers are generated by AI, and that they need to verify them by checking the primary law, and if there are any errors, that primary law (and further standard research checks, reviewing KeyCite flags, and statute annotations, etc.) would reveal them.

We accept that 90% isn’t the end goal. And we’re continuing to work to get better each day. However, until technology advances, when working with LLMs there will always be the potential for inaccuracies.

The previous statement by Mike Dahn doesn’t really go into details about hallucinations. It instead generally focuses on ‘accuracy’. Can TR remove hallucinations from its products, e.g. via RAG or other methods? And are hallucinations riskier for users than a product just being ‘inaccurate’?

We have traditionally tested for and termed hallucinations to be fabrications by the system. We rarely see our definition of hallucinations (or fabrications) in Westlaw AI-Assisted Research, in part, because our systems are close-looped and only source answers from TR’s richly enhanced content. This data is supported by more than 1,600 attorneys who continue to annotate the law, create digests, and build the West Key Number System taxonomy to organize the law in a way that supports the AI to deliver the right answers. 

But, while these models can be incredibly powerful for legal research, they all hallucinate, even when techniques such as Retrieval Augmented Generation are used. However, neither hallucinations, a particular type or subset of inaccuracies, nor other types of inaccuracies have much risk of harm when primary sources are reviewed as part of a standard research process.

Where do things go from here? Will TR be more open about how it performs tests, e.g. publish those tests? Will you work with Stanford again?

As mentioned in our recent blog, Thomson Reuters is committed to transparency, and exploring the creation of a consortium of stakeholders to work together to develop and maintain industry standard benchmarks across a range of legal use cases. And we are already making progress, talks are in the early stages, but we are hopeful we can find a way to work together with Stanford, amongst others, on this important opportunity. 

And finally, do you believe there should be sector-wide standards for genAI tools in legal and if so, would TR be interested in helping to create these standards?

In short, yes. We are very supportive of efforts to test and benchmark legal research solutions. Thomson Reuters is in early discussions with legal tech firms, law firms, academic researchers – including Stanford – to form a consortium to develop benchmarking for AI solutions. By working together, we can develop and maintain state-of-the-art benchmarks across a range of legal use cases.

[ Note: LexisNexis also told this site yesterday that they are happy to work with others, including Thomson Reuters, to build benchmarks. Let’s see where this consortium idea gets to.]

Conclusion:

As noted before, the publicly listed tech and data company wants to draw a line under this and move on. And the responses, and then the additional responses this evening, do help. So, thanks to Thomson Reuters for those. But, more clarity from day one here would have helped a lot.