Screens Publicly Announces 97.5% GenAI Accuracy in Transparency Move

In a move aimed at bringing more transparency to the sector, legal tech company Screens, which is the sister business of contract-focused TermScout, has decided to make public its genAI accuracy levels, stating that the platform achieved overall accuracy of 97.5%, while Precision clocked in at 0.9857, Recall at 0.9822, and F1 at 0.9840. In a comprehensive report the company has set out why it did this (see summary below) and explains in detail how it came to these results for Screens’ particular use case, (see full report here).

Why does this matter?

Anyone reading about the legal tech market in recent weeks will have noted the Stanford University genAI accuracy study, as well as the debate that has followed, which has now headed in multiple directions: from insurance, to what an acceptable error rate for human lawyers is, to how human lawyers work with genAI tools.

One other key aspect has been the idea that the legal / legal tech sector creates benchmarks and standards by which different specific use cases for genAI tools can be tested and shown publicly to have attained a certain level of accuracy, and with the ‘workings out’ included to show how those results were attained.

Such moves should build trust across the market and ensure that buyers don’t have to do extensive accuracy tests of their own. While larger law firms and inhouse teams have innovation or legal ops teams to handle such detailed testing, many firms don’t. Or perhaps even if they do, they don’t want to spend time having to test accuracy because they would like simply to use a legal tech tool with some degree of confidence – which seems a reasonable expectation in any industry.

Now, you might say that focusing on accuracy slows things down. And perhaps it does – temporarily. But lawyers will find out one way or another, so it’s best to meet this head on, right now, and as publicly as possible.

This site would argue that the accuracy of genAI tools and what level is acceptable in the context of human lawyers also making mistakes in their work, needs to be openly discussed and explored. Confidence in these tools and a clear understanding of what ‘good results’ look like is essential if we are to use genAI at scale, i.e. rather than on edge cases such as having a discursive chat with GPT4 about a new piece of legislation, and instead use LLMs in multiple ways across transactional and disputes work, from the top to the bottom of the law firm production pyramid, to across the entire inhouse team, to across the contract management and compliance groups as well.

GenAI has the potential to reframe the entire legal production model, but it will only happen if lawyers use it at scale. Confidence in legal genAI products is therefore key.

In short, light drives out doubt. From there we can then build the use of genAI tools more confidently into billable work among law firms and ALSPs, and across the work that inhouse counsel handle.

P.S. In terms of what Screens does, it is a contract AI solution that features community playbooks created by legal experts to kick-off your own contract review — all in Word. See more below.

Screens Accuracy Evaluation Report

By Otto Hanson, Founder TermScout & Screens

On March 20 this year I attended one of Howard Chao’s renowned Legal Tech Round Table events in Bonny Doon, CA. About 40 leaders from the legal tech ecosystem gathered around a large table for an entire day to discuss the state of AI in the law.

One notable discussion centered around the difficulties of measuring AI accuracy and hallucinations in legal applications. The consensus in the room was that so much of the work we ask Generative AI to do in the legal domain is subjective (e.g. writing a brief, redlining a contract, or finding relevant case law) and measuring the accuracy of subjective tasks is itself subjective and fraught with challenges. Artificial Lawyer’s own article ‘We Need to Talk About GenAI Accuracy‘ published last week arrived at some of the same conclusions from that round table – that measuring accuracy is one of the “serious challenges” surrounding adoption.

Today the team at Screens.ai, a sister company of TermScout, an AI contract solutions provider serving customers such as IBM, Lenovo, and NetApp, published a report that lays out a framework for measuring accuracy on one very specific, common, and important legal task: identifying if complex legal standards are met in a contract. This type of binary classification problem is objective, meaning we can easily set up experiments to measure accuracy. This also happens to be a foundational capability of the Screens product, which is primarily focused on using AI to execute contract playbooks (called “screens”).

In an experiment undertaken by Screens.ai, the team took contracts of varying types and had a team of human contract analysts carefully review them to check for compliance with certain standards from a set of open-sourced Community Screens, including those created by legal experts like Laura Frederick, Andrew Glickman, Fatima Khan, and Colin Levy. This “Evaluation Set” consisted of testing an average of 14 legal standards across 51 contracts for a total of 720 human-verified, provable data points. With this Evaluation Set, the Screens.ai team then asked its production AI stack, consisting of proprietary techniques and technology powered by commercial LLMs, to review the same contracts for compliance with these standards in order to measure accuracy.

The Screens.ai platform achieved overall accuracy of 97.5%. Precision clocked in at 0.9857, recall at 0.9822, and F1 at 0.9840. The report further details the methodology, including links to the screens with the exact standards that were used, and has a section detailing how the experiment can be replicated by third parties wishing to confirm the results of the study.

The report also compared a number of different techniques and LLMs against the Evaluation Set to measure how these variables impacted AI performance. Here are some highlights from these additional aspects of the study:

1.      A comparison without Screens’ proprietary technology and enhancements (e.g. using only off the shelf retrieval augmented generation and off the shelf LLMs) resulted in three times the number of AI errors.

2.      A comparison without “AI Guidance” – nuanced, precise instructions included in the standard prompts – resulted in four times the number of AI errors.

Screens data, June 2024.

3.      A comparison of nine commercial LLMs revealed that GPT-4o was the most accurate, followed by Claude 3.5 Sonnet (just released by Anthropic last week).

Readers should note that the experiment did not attempt to wade into the murky waters of measuring the accuracy of other, more subjective AI functions available on the Screens platform, such as redlining or summarizing contracts. Nonetheless, the report provides a framework for separating subjective tasks from objective ones and focusing on measuring accuracy in the latter. It shows that while AI may still be flawed in many ways, the technology shows tremendous promise and is capable of adding meaningful value in contract review processes today, in particular when working with vendors building sector-specific applications on top of commercial LLMs.

We’re delighted to share this report with the community in hopes of making a small contribution to this important subject, and we encourage anyone interested in these questions to read the full report here and reach out with any feedback.

[ So, there you go, the genAI debate takes one more step forward. It would be great if every legal tech company could do something similar, i.e. setting out clearly what part of their product(s) was tested, how it was tested, some useful comparisons (if possible), and then showing the scores.

Is it a perfect scenario? Would a worldwide legal tech genAI body that could independently test every tool’s accuracy be even better? For sure. But, such a standards body doesn’t exist – yet – so, the best we can have right now is for companies to be as transparent as they can be under their own steam and help to increase trust in this technology. And as mentioned, see the full report via the link above for far more detail about how they did this. ]

[P.S. if you would also like to publish your product’s accuracy scores, then AL would be very happy to share them, as long as they come with a similar level of supporting documentation and detail, at least until doing this becomes so normal there is no longer a need for AL to highlight it. ]