Opinions

The IQ of AI. Part 2. Quantitative Testing, Dogfood and Red Teaming

A review of how the assessments and reports on the safety and capability of their new products and tools are generated.

By Gary Kantor

24 April 2024

13.7 min read

Image by Google’s Deepmind on pexels.com

Models and claims

In December 2023 Google introduced its newest generative AI tools: Large Language Models, in three sizes – Ultra, Pro, and Nano (meant for portable devices like phones). The company said its most-capable model, Ultra, “advances the state-of-the-art in 30 of 32 benchmarks and in every one of 20 multimodal benchmarks”.

“Gemini Ultra surpasses human-expert performance on the exam benchmark MMLU, scoring 90.0%, which has been a de facto measure of progress for LLMs …. sets new state of the art on most of the image understanding, video understanding, and audio understanding benchmarks without task-specific modifications or tuning”.

Do these claims hold up?

How does one test an AI? Case reports?

Last time, we reviewed a set of “qualitative” examples from Google’s report in the pre-print journal Arxiv. You could compare these to case reports, in medicine often considered the lowest level of evidence, but useful, nonetheless.

The focus in this article is on qunatiative testing methods used by Google and others. Three months is like three years in AI development time so the products discussed here will have changed but methods of evaluation are not changing quite as fast.

I’ll do a quick review of how the Google team did it – assessed and reported on the safety and capability of their new products and tools.

In trying to replicate Google’s findings I found surprising differences between their published and my real-world experience, and that Google Gemini didn’t stack up so well against ChatGPT.

Quantitative assessment

The Google/Gemini team’s quantitative assessments used a variety of benchmarks – standardised tests – to see how well the LLM could handle different types of information and tasks, comparing performance to previous models and to what would be expected from human experts.

Who did the evaluation?

Google’s own team did most of the assessment. The report has 133 references (page 40) and nearly 1000 authors and contributors (page 53).

Text-based evaluation tests

These are tough challenges – for example understanding complex physics problems and performing tasks in multiple languages. There are academic (exam) benchmarks, tests for factuality, for solving complex reasoning puzzles, for understanding and generating multi-language content, and for the ability to grasp lengthy discourse (text has to be held in memory). These capabilities not only enable reading and summarising a dense academic paper for example but also debating its findings, translating it into multiple languages, and extending its context.

MMLU (Massive Multi-Task Language Understanding) is a benchmark test made up of multiple-choice questions in 57 subjects that assesses accuracy and problem-solving ability.

“Gemini Ultra can outperform all existing models, achieving an accuracy of 90.04%. Human expert performance is gauged at 89.8% by the benchmark authors. Gemini Ultra is the first model to exceed this threshold … prior state-of-the-art result [was] 86.4%”.

Other tests (Table 1) include HellaSwag for “common-sense reasoning”; DROP for reading comprehension and arithmetic; GSM8K – grade-school (i.e. high school) math, and MATH which has math problems across 5 difficult levels and 7 sub-disciplines. BIG-Bench-Hard is a “subset of BIG-Bench” which contains 204 language-related tasks, from chess-based prompts to emoji-guessing tasks. WMT23 tests machine translation capabilities while HumanEval and Natural2Code assess coding in the python language.

Figure 1:
Benchmark tests including MMLU used by Google to evaluate its new Large Language Models

Prompting matters

Anyone using ChatGPT or Bing or any other LLM knows that the detail and nature of prompts (inputs to the model) significantly influence the model’s output. Prompts can range from single, straightforward questions or commands designed to illustrate specific capabilities (such as generating a coherent response to a simple query or request) to more complex, detailed ones that require the model to integrate multiple pieces of information, reason over data, or generate content that spans various modalities (like text and images).

“We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach that accounts for model uncertainty”.

In Figure 1 (above) from the Gemini report, note the variety of inputs – the variable number of “shots” (iterative prompts), and Chain of Thought methods (asking the LLM to go step by step and explain its reasoning). These have significant potential impact on results. Unless prompting is standardised it may be difficult to fairly compare the performance of LLMs.

Multimodal tests

MMMU is a benchmark test for evaluating multimodal models. This is about putting different types (modes) of information – images, videos, and audio – into the mix; not just understanding a picture or a clip but generating new images, analysing video content, and interpreting sounds to mirror human perception.

These capabilities translate into an LLM being able to watch a silent film, generate a suitable soundtrack, and describe the plot in detail, or take a scientific concept and illustrate it with detailed diagrams. Tasks in MMMU demand college (university)-level subject knowledge and deliberate reasoning. Gemini Ultra performed better (62.4) than GPT-4 (56.8), both single (“zero”) prompts.

Image by Google’s Deepmind on pexels.com

Post-training model evaluation

Post-training evaluation is about how well the models adapt and improve after their initial “education” as they are made into apps or connected to other systems. A well-trained model can for example become a multi-talented digital assistant able to switch between roles – from a translator to a programmer, to a creative designer. It’s like giving the model a post-graduate course, then assessing its ability to follow instructions more precisely, use tools (like calculators or search engines) effectively, understand and generate content in multiple languages, interpret and create multimodal content, or write code.

Results of the Google evaluation showed improvements across languages compared to older models, gains in coding and reasoning, better performance on academic benchmarks, and enhanced multimodal capabilities

Safety and impact assessments

This part of the testing framework looks at how models interact with the world, focusing on ethical considerations, prevention of harmful outputs, and the overall impact of deploying advanced AI to ensure it can be a safe and positive addition to a team or product.

Safety policies, data curation, and blocking of harmful requests were examined by the evaluators. Also,examination of the models’ decision-making processes, displays of bias and prejudice, their training data, and how they’re fine-tuned to behave “in the wild”. This tuning includes the process of reinforcement learning from human feedback (RLHF) (aligning it with preferred values and behaviours. In testing scenarios the model is asked to navigate harmful or morally complex questions, or respond to attempts at manipulation (e.g., “how do I build a bomb”; “how do I prepare poison”?).

Image by Google’s Deepmind on pexels.com

Dogfooding

Google developed an extensive dogfooding (from the expression “eating your own dog food,”) program. This is a practice used by companies to internally test their own products in real-life situations in order to develop confidence in their quality and effectiveness and improve them before release.

Red teaming

Red teaming is like a security drill for AI systems where experts play the role of hackers. They test the model’s ability to handle different kinds of cyber threats, focusing on keeping the system safe, secure, and private. This process helps find and fix security gaps. There is also a focus on how the models interact with people from different backgrounds, ensuring they’re fair and don’t unintentionally promote stereotypes or hate speech.

Mid-February, amidst huge controversy, Google halted image generating by Gemini and issued an apology. Images it was generating were inaccurate and offensive, for example depicting America’s Founding Fathers as black, the Pope as a woman and a Nazi-era German soldier with dark skin.

Can the results be trusted?

Google staffers pioneered major advances in AI, including the 2017 invention of the transformer model (transformer is the T in GPT), on which the current generation of generative AI is based, so they know what they’re talking about. But Google is under pressure to catch up to market leader OpenAI, makers of ChatGPT. Like any company it needs to promote itself and its products. Along with objective methods of judgement we need independent assessments. In healthcare, we require peer review and ask authors to declare their potential conflicts of interest. The Gemini report is not independent, not peer reviewed.

Trust but verify?

External testing

A small set of independent external groups helped identify areas for improvement in model safety, considering major risks like: autonomous AI replication; chemical, biological, radiological and nuclear (CBRN) risks; cyber-capabilities and cyber security; and societal risks, including: “representational and distributional harms, neutrality and factuality, robustness and information hazards”.

Limitations. Are benchmark tests realistic representations of reality?

Measurement frameworks should consider how the AI is applied, i.e.,”the settings and workflows [the AI] would be embedded in, and the people that would be affected.” Especially in healthcare, there’s an ongoing need for R&D on generative AI’s notorious tendency to “hallucinations” – ensuring that model outputs are more reliable and verifiable. LLMs still struggle with tasks requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual reasoning.

Other aspects worth noting and evaluating but not the focus of the Google report include:

Speed and efficiency are important. New hardware (Groq2) can radically increase speed while using less energy – vital if AI is to become the driver of economic growth and human progress, without catastrophic damage to climate and environment.
Writing style. Google Gemini is said to have more “personality” in the way it writes.
Access to real time data (i.e. the internet), critical for factuality. When I asked it, 48 hours before Superbowl LVIII (58), which teams were playing, Gemini said this was not yet known. 24 hours after the final whistle was blown (do they blow whistles in American football?) it was able to provide the result. But on February 12, Gemini declared that the Six Nations rugby tournament 2024 has not yet begun (it started February 3).
Easy, safe, and private access to your own data is a necessity but was not available at launch. You can’t upload a file and though it is said to be connected to Google Workspace this is not easy to find. No doubt this will change.

What about cost?

At $20 per month, full versions of ChatGPT and Gemini are probably unaffordable to most everyday users in South Africa. Nevertheless, companies like Microsoft, Google and OpenAI may be losing cash on each customer because running AI systems in massive data centres that slurp electricity isn’t for free. This can’t last forever. The race is on therefore for model efficiency – to get more for less.

Conclusion

The Gemini report is evidence of a commitment by Google to thorough evaluation that will help ensure efficacy, safety, and broad applicability of LLMs. But self-reporting has its limits, and there’s a need for more challenging, robust and realistic evaluations to measure their true capability in the real world.

The report advocates for an approach to AI development that benefits humanity while minimising risks; the AI equivalent of “do no harm”. As these models become part of our daily lives, they should do so in a way that is ethical, safe, and beneficial for all. A massive tech arms race is being driven by Google, Microsoft, Meta, Nvidia (now the third or fourth biggestcompany in the world), a host of other new and established players[1], powered by billions of dollars of investment. Products are changing day by day, making evaluation a moving target, and much more challenging.

A couple of days after this evaluation (Feb 15), Google released an update – Gemini 1.5 that is said to be more capable, for example with a much large context window (memory). Read Google’s press releases but be sceptical.

In the third of three papers, we’ll discuss assessment in healthcare.

[1] OpenAI has GPT-4 and ChatGPT. Perplexity, Claude, Meta (Llama, Grok) also deserve consideration, as well as open source LLMs such as Mixtral.

0 Comments

Get an email whenever we publish a new thought piece

Author

Gary Kantor

Dr Gareth (Gary) Kantor trained and worked in Canada and the USA before returning to SA in 2005. He is an anaesthesiologist with diverse areas of interest including quality improvement and patient safety, health data and electronic health records, patient-reported outcomes, health technology assessment, preoperative care, genomics and precision medicine. Gary is an Improvement Advisor and Faculty, Institute for Healthcare Improvement, Boston, MA, USA, honorary lecturer at UCT, and Assistant Professor, Case Western Reserve University, Cleveland, OH, USA.

View all posts

More Insights

Focused Thought Pieces

Hypertension: common, dangerous and still poorly controlled
By Gary Kantor

Read Article

Beyond the Box-Tick: Making Independent ORSA Reviews Count
By Nicole Kriek

Read Article

Hypertension is a test of whether healthcare is measuring what matters
By Gary Kantor

Read Article

Opinions

The IQ of AI. Part 2. Quantitative Testing, Dogfood and Red Teaming

Models and claims

Do these claims hold up?

How does one test an AI? Case reports?

Quantitative assessment

Who did the evaluation?

Text-based evaluation tests

Prompting matters

Multimodal tests

Image by Google’s Deepmind on pexels.com

Post-training model evaluation

Safety and impact assessments

Image by Google’s Deepmind on pexels.com

Dogfooding

Red teaming

Can the results be trusted?

Trust but verify?

External testing

Limitations. Are benchmark tests realistic representations of reality?

What about cost?

Conclusion

Author

Hypertension: common, dangerous and still poorly controlled

Beyond the Box-Tick: Making Independent ORSA Reviews Count

Hypertension is a test of whether healthcare is measuring what matters

NAVIGATION

FOLLOW US

PRIVACY

INSIGHT SOLUTIONS GROUP

Leave A Comment Cancel reply