Why We Need Standards For Legal GenAI

Imagine buying a car from a vendor for which there are no standards, nor benchmarks for measuring and understanding: its safety features, its speed, its fuel consumption. Perhaps the car showroom sales person is even vague about the total price. Would you still buy it?

Maybe you have heard of the brand and just want one, come what may? Well, that’s your personal choice, albeit arguably a reckless one.

Now imagine you hold a position of responsibility in a law firm or inhouse legal team, where part of your role is to bring aboard new technology to support a multi-million, or even multi-billion-dollar, business?

In the same way, now buy a generative AI product where there are no clear indications on its performance – at least that can be verified without extensive testing by your firm and using metrics you’ve thrown together on your own. Nor are there shared standards for developing those benchmarks.

Would it not be simpler if the following were true:

  • That dependent upon the tool’s use case there are benchmarks for performance that are commonly shared within the industry, perhaps even published openly? That way you can immediately get a sense of how it performs, as long as the vendor is transparent.
  • That the benchmarks are based upon shared standards, so if you are looking for X level of accuracy that metric is defined by specific things (i.e. standard measures) that help to show this clearly, and that those standards are shared openly as well, so everyone is comparing like with like. Again, this will be use case specific – see AL article here that highlights how different use cases may need different accuracy criteria.
  • You can then, via each product / feature use case, get a sense of what things can do and make a judgment without having to run your own internal tests.
  • And you may still want to do your own tests, perhaps because you have the resources to do that, but even so, knowing what the common benchmarks are and knowing what the standard measuring systems are will really speed things up. That is also good for the vendor, which can make a sale more quickly as well, or otherwise move onto another prospective buyer. Either way, less time is wasted.
  • Plus for the vendors, aside from time-saving they also have made their lives simpler. No longer will they have to ‘wing it’ in sales meetings with lawyers and their innovation teams and just cross their fingers behind their backs to hope that the buyers will be satisfied with how it performs. Now, they can say: ‘For this use case, we have reached X accuracy, using these following standard metrics in accordance with industry norms. All the information is on our website, so too is an explanation of the tests we did and how we did them.’ To which the buyer then says: ‘Thank you. You’ve just saved us a lot of money and proven you can be trusted to be transparent. This may be the beginning of a beautiful partnership.’
  • Note: part of this may involve asking yourself: is this more accurate than a human lawyer? And that’s a valid question. However, it’s also a question that leads down a rabbit hole if pursued indefinitely, as all lawyers will perform differently, and then even more differently on various tasks. That said, when seeking to be objective about a tool’s accuracy, it’s helpful to keep in mind how your firm approaches lawyer accuracy and also how it handles oversight to remove any errors – which all legal teams and law firms do to varying degrees. As noted in previous AL articles, we should expect high levels of accuracy in certain use cases, and we cannot expect a law firm to have to re-check everything multiple times, especially the base layer of facts in a matter, as that would be uneconomical for the business.

What Does The Outcome Look Like?

Let’s say various parties come together to agree upon these benchmarks and standards. What does this look like and what outcomes do we get to? Here are four scenarios:

  • Level 1 – There is a very constructive conversation. Ideas and information are shared, parties agree to be transparent on the tests they conduct to help others develop their own benchmarks, and everyone comes away better for it. Some vendors also promise to be more transparent about their genAI tool performance. There is no formal outcome, in part because these are all individual businesses and getting a large-scale agreement is not easy, but people will keep in contact, and knowledge levels will be higher. Overall, a helpful improvement.
  • Level 2 – A formal agreement is made and all the various parties agree to 1) a set of benchmarks and the supporting standards for them, 2) to share information based on tests and use of tools they have at their firm or inhouse, and 3) the vendors also buy into this and also commit to transparency, as this may well boost their sales. But, although an organised network (see below) may have helped to convene things, there is no ‘regulator’ or ‘standards body’ standing behind this, as such. It’s an ‘alliance of the willing’, kept going by individuals who wish to sustain it.
  • Level 3 – A body of some type either takes on this need as part of its formal role, or such a formal body is created, to take on the responsibility to sustain these standards and benchmarks, to drive forward transparency and communicate new developments, all for the good of the industry. They could even act as a testing centre, examining products and then publishing their results. Many such bodies exist across the economy in various areas, but they need some level of support from market participants to function.
  • Level 4 – One other outcome, which is probably the least likely here, is that a truly formal body is created that becomes more like a regulator, or is in fact actually a regulator, which is not just a promoter of benchmarks and standards, but actively enforces them, with penalties for those vendors who do not adhere to the rules, which perhaps may simply be the withdrawal of a ‘kite mark’ or ‘quality standards badge’ the body can award or take back each year.

LITIG’s Project

As announced on Friday morning, LITIG, which is a UK-based group of legal innovation and legal IT experts, mostly from law firms, has announced with the support of Artificial Lawyer a project to try and address the above challenges. John Craske, Chief Innovation & Knowledge Officer, of international law firm CMS, and a Board member of LITIG, has created the initiative and will help to host it. Fellow legal tech media site, Legal IT Insider is also now supporting the project.

If you would like to get involved, there is a form you can use to get in contact with LITIG, it can be found here. Note: the form closes this Wednesday, July 3rd.

Which of the four options will be reached remains to be seen. And there is no reason why a less determinative outcome may not evolve over time into something more permanent. Either way, we have to start somewhere, so let’s see where this goes. The alternative is for everyone to sit in isolation in our silos, while the vendors decide in a piecemeal way whether to be transparent or not – which is where we are now. And a special hat-tip to Screens for helping to lead on this recently – see here.

Last word. While there were discussions about holding comparative bake-offs for the first wave of NLP/ML tools, the potential for genAI to alter the legal world has taken the sense of urgency to establish some shared understanding and transparency here to a whole new level. Such an outcome may not happen, but it would certainly be in the interests of everyone in the legal sector if it did.

Richard Tromans, Founder, Artificial Lawyer, July 2024