Murnell

With each new AI release or system comes a string of benchmarks and evaluations. The company behind the release totes the state-of-the-art nature of their AI on a number of key areas, highlighting its superiority over competitors. They might also have industry partners demoing integration with another set of tools.

It all makes for a great press release, but for businesses looking to avail of new technology, it leaves them with very little of anything that isn’t marketing bluster.

In this article we will go over:

The cheating of evaluations and over-generic over-promising from big firms.
The rush to build ‘State of the Art’ systems, before management knows what it even wants.
The future cost of pushing ahead with these half-baked deployments.
The optimal way to plan out systems for your company.

Of primary concern to any firm looking at the AI hype should be “will this increase some bottom-line metric?”. There are a number of other high-minded ideas some firms have, but these are misguided and far too short-term in outlook.

The key issue for firms that are looking at the bottom line is how can they tell before investing heavily if something of this nature will prove beneficial? For this, the industry has thus far relied on self-reported benchmark evaluations. These originate from when the field was still more commonly referred to as Natural Language Processing (NLP) and automatic metrics were used to compare the efficacy of different methods of NLP tasks. With the advent of Transformers and the explosion that was ChatGPT, they have remained as a staple of new model releases. Quizzing a release paper, you will find common abbreviations or acronyms like "MMLU", "GPQA" and my favourite, "HellaSwag". Each denotes a different area or task set for the model to demonstrate capability in.

This sounds well and good, until we return to the first of two points. It is self-reported. Google, Meta, OpenAI and all the big groups all juice their results. This is not a simple accusation; it is demonstrably true at different release points.

Note that these firms can run each and every test however many thousands of times they want and report only the highest testing value. The test set can leak into the training set, and they control all variables. It is hard to expect that when engineers and teams at these firms rely on their results for their career and they have no real oversight, that they would not alter them in some way.

What About Independent Evals?

If then, you as a firm looking at these models want a method that isn’t a machine evaluation that is gamed so you can determine the feasibility of such a model for your firm, how would you do so?

A few months ago, a lot of people would have suggested looking at LMArena, a site that rates models via user preference of responses. Finally, a system that cannot be rigged by tech giants, surely?

For a time, many users complained that models like Google’s older Palm 2 were weighted unfairly high, with some positing that it was due to the structured and well-formatted nature of their outputs, rather than the quality of the information. This phenomenon was taken advantage of when Meta released the first of their most recent Llama 4 series, adding Llama 4 Maverick to the site. It quickly became one of the top 5 models on the site. This was a huge event as now there was an open-weight, relatively easy-to-run model that could compete with GPT-4o and Google Gemini 2.5. Only it was a lie.

This model was a specially trained version of the publicly available model, fine-tuned on data from responses from LMArena itself. This data contained the style and type of responses people were more likely to rate higher. There were also rumours that the distinctive style of the responses would have allowed Meta personnel to recognise and rate the Llama model themselves on the site. LMArena subsequently took it down and now if you visit them at LMArena Leaderboard (click on the leaderboard tab) you will need to scroll down about 150 rating points or rank 38 at time or writing, a far cry from the top 5.

Further reading of the problems with this kind of evaluation can be had from this excellent paper published by Cohere: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

Returning to the perspective of the interested firm, looking at the AI landscape and now wondering how they can trust anything these firms say.

Build Your Own Evaluation

You can’t. That is the short, simple, truthful answer. The only way for your firm to know if an AI system is worth pursuing is to create your own evaluation that it would have to pass.

In this, I am advocating not just a simple textual or image benchmark, but taking a top-down strategic view, and planning accordingly. Building out your own requirements via an outcome-first system, rather than machine or human evals. Human evals can, however, be incorporated at a testing stage to judge individual components and such.

For example, if you are a professional services type firm, say an accountancy, and your company is interested in expanding capabilities in some area. You might want to evaluate the potential of employing AI systems for this purpose, either as an assistant/tool for auditors or to automate some of the work by itself. Regardless of the exact purpose, these evaluations by themselves would provide little useful information on the feasibility of such a system.

Instead, what you would need to do would be to deconstruct the purpose or goal into steps, actions and outcomes. For instance:

Can existing models be built into a system that can reliably analyze REIT documents?
Can the same system be used to analyse corporate balance sheets, or is a separate one needed?
Could we then build a system that does the previous step in this workflow and then the next step?

This is the key point I want to stress in this article, on top of the untrustworthiness of the industry, that if your goal is narrowly defined, you will need to create your own similarly narrow evaluation for such a system at every possible point of failure in that system.

In the case of our accountancy firm, a very basic version of this might be:

“Does the system reliably parse, analyse and structure input data with a success rate above X?”
“Does it do this for under a cost of €Y a job?”
“With all costs accounted for, does this ultimately make/save the company money?”

All metrics and requirements will need to be created by people with a concept of the firm's goals and a strong understanding of the underlying technology.

Don't Follow the Hype

Is this not what firms are currently doing, you might ask? No. The biggest professional services firms are in a ‘build first, make a press release to demonstrate our understanding of this area thus generating more business for us’ mode. A quick search will find you countless examples of Big Four or leading banks releasing their ‘legal’ or ‘finance’ AI system when in reality all they have done is fine-tuned a Llama model on finance texts, added a vector store database and called it State of the Art. I will refrain from naming names in this article, but they are easily discoverable. If you want to rate the capabilities of a firm in the space, research if they have done what I have just described.

It is very easy to make a basic AI system that is out of date in 6 months and then rebuild it. It is far more involved to create a structure that generates real value for your firm and that can be improved, rather than rebuilt. Every firm is adding an AI function of some sort, but few if any are designed with bottom-line metrics in mind.

With all this seemingly downbeat rhetoric, you may ask, why bother? In the next article, we will go through why it is worthwhile to still explore the landscape and appropriateness of such AI based systems for your own firm.

The third article in this series will take an example job function imagined above and demonstrate how it should be translated into an AI-compatible workflow. Our initial focus will continue to be on professional services firms, but the concepts are transferable to other industries.