Abstract
Enterprises are actively seeking ways to integrate LLMs across their operations, with one key area being the analysis of operational data to uncover insights that impact core business KPIs. Developing AI-powered agents to perform this task demands a specialized benchmarking framework: in such a framework, each problem would include multiple datasets together along with a description of the relevant KPI, and the solution to a given problem is expected to provide a set of insights supported by the data. Furthermore, to facilitate the development and evaluation of specialized insight-discovery agents, the framework is required to dynamically generate new benchmarking problems on demand.
In this article, we present a benchmarking framework for data-centric insight discovery, specifically designed to address these emerging needs compared with other ML/GenAI benchmarks:
- A focus on generating explainable insights rather than predictive models
- Provision of realistic, ground-truth insights (in natural language) for each problem
- The ability to dynamically generate new data-centric benchmark problems
To drive meaningful and continuous improvements in operational KPIs, Generative AI applications must be able to discover insights hidden within operational data. A critical component in developing agents capable of insight discovery is the establishment of robust evaluation and benchmarking frameworks for the task of discovering patterns impacting key metrics / outcomes.
When we refer to insights, we assume that there is a well defined KPI, being measured and tracked - and when we say insights, we mean factors, expressed in a natural language, affecting this KPI.
It’s important to note that while insight discovery is clearly an important capability, embedded both inside domain-specific/use case-specific solutions and within enterpriseBI tools, there is no benchmark to be used to assess such capabilities (unlike a variety of benchmarks for both predictive models and large language models).
Today, we are introducing the first-ever benchmark designed to evaluate insight discovery capabilities in AI agents. This includes both the benchmark itself and an accompanying evaluation framework with performance metrics. The benchmark is synthetically generated, enabling scalable expansion as more resources become available. We propose a practical approach, focusing on solving real-world problems. This is not a benchmark for predictive modeling; insights are statements in natural language and need to be evaluated as such. In particular, we note that community assets such as some of the predictive modeling benchmarks and problems shared publicly on Kaggle may help someone who is building an insight discovery benchmark. We will discuss what constitutes a good benchmark for insight discovery, how to evaluate agents for this task, and how to leverage a dynamically generated benchmark to auto-improve such agents.
Early results indicate that this framework not only provides a viable way to assess agents’ performance but can also be leveraged to autonomously improve their capabilities. In the sections below we drill down into the metric definitions and the results for the baseline agents.
This article is referring to version zero (or beta) of the framework, which we are finalizing these days. Next week, we will share the problem specifications for this version. Looking forward, we aim to publish version 1 of the benchmarking framework by the end of the year. At that time we will share the entire benchmark - the problem specifications, the problems’ datasets, and the evaluation framework . As part of this effort, we are organizing a workshop onLLM-generated problem specifications for insight discovery. Early collaborators will be invited to contribute to this exciting initiative.
Last week, as we were making this benchmarking framework ready to share, OpenAI announced MLE-Bench– a benchmark for evaluating machine-learning agents on machine-learning-engineering tasks. This is not only another indication that this type of task (machine learning engineering, insight discovery, etc.) is of high interest to the community. You will find a paragraph regarding the synergies between these two benchmarks in the next section.
Below you will find the following sections:
In recent years, the development of large language models (LLMs) has been guided and assessed using a variety of benchmarks. Leading AI labs like OpenAI, Anthropic, Meta FAIR, and Google Gemini frequently reference several key benchmarks when announcing their models. These benchmarks are critical for measuring the capabilities and performance of their LLMs. However, they have notable gaps, especially concerning the complex requirements of insight discovery tasks.
MMLU (Massive Multitask Language Understanding):
BBH (Big-Bench Hard):
HumanEval:
DROP (Discrete Reasoning Over Paragraphs):
MLE-Bench (evaluate agents on ML engineering tasks):
A curated collection of 75 real-world problems from Kaggle, covering a wide variety of data formats (images, audio, text and tabular data). In addition to the benchmark, the authors discuss approaches for constructing ML agents and evaluate the performance of several candidates on their benchmark.
The paper “Self-Taught Evaluators” by Tianlu Wang et al. presents a method for training LLM evaluators without human-annotated data. Instead of relying on costly human judgments, the authors use an iterative self-training approach with synthetic data. Starting with a base LLM, the model generates preference pairs and trains itself as a judge, iteratively improving performance. This approach matches or surpasses models trained with human data, achieving significant improvements on benchmarks like RewardBench, increasing scores from 75.4 to 88.3. It offers a scalable alternative for evaluating LLMs, particularly for extracting insights from structured data.
While these benchmarks have undoubtedly propelled significant advancements in the capabilities of LLMs, they remain limited in scope and do not adequately capture the models’ ability to perform more specialized tasks, such as insight discovery in structured data. Traditional benchmarks typically focus on natural language understanding, generation, and reasoning over unstructured data sources like text and dialogue, but they fail to account for the nuanced challenges involved in extracting meaningful insights from structured datasets such as databases, spreadsheets, and tabular information. As a result, there is a growing need for new benchmarks that can evaluate how effectively LLMs can navigate, interpret, and synthesize information from structured sources, ultimately pushing the boundaries of what these models can achieve in data-driven environments.
To address these gaps, there is a pressing need for a dynamic benchmarking framework tailored to insight discovery. This framework should provide both means to automatically generate insight discovery problems, as well as means to evaluate a given agent that claims to perform insight discovery tasks (or in other words, solve insight discovery problems).
In particular, a problem in the benchmark should have 4 components:
The benchmarking framework should therefore:
We are interested in exploring synergies with related benchmarking frameworks, and in particular explore how to leverage benchmarks based on real world problems (such asMLE-Bench) to automatically generate a scalable benchmark with real-world-like properties.
This section will provide:
We start with the list of requirements from the benchmark as a collection of problems/datasets and the evaluation framework.
To be explicit, an item in the benchmark would be a set of:
Where the problem specification consists of:
We expect the benchmark generation utility to be able to get as input a problem description per above and:
A good benchmark should be rich & diverse, robust and realistic. What we mean by this is:
Rich & Diverse:
Robust:
Realistic:
In order to assess agents attempting to solve for the insight discovery task, we need an evaluation framework beyond the benchmark itself (the problems and associated data). It will provide the metrics for measuring how well the insights discovered by such an agent cover the ground-truth solution, what is the predictive power of these insights, and ensure proper use of the data (e.g. no temporal leaks). Therefore the evaluation of solutions will be based on these three metrics:
Here is an example of a problem specification and the data generated for it.
The task is therefore to take the input specification as above, and generate the three data tables (Patients, Lab tests, Lab tests info), where the Patients table is also split to train and test. Part of the output is also the data associated with the ground-truth insights; that is we enrich the train and test tables with columns representing each of the insights/factors that are listed in #6 of the input.
The task is therefore to take the input specification as above, and generate the three data tables (Patients, Lab tests, Lab tests info), where the Patients table is also split to train and test. Part of the output is also the data associated with the ground-truth insights; that is we enrich the train and test tables with columns representing each of the insights/factors that are listed in #6 of the input.
Architecture diagram of the benchmark generation system
Scope of the first version of the Benchmark
Beyond that:
Beta Takeaways and plans V1 completion:
Benchmark creation / data generation
The creation of the benchmark is composed of 3 steps:
It should be noted that we are not always able to generate data for each of the specified insights. Similarly, we sometimes fail creating a strong target correlation for each of the specified insights, so some of them end up only weakly correlated with the target.
Our (currently internal) project ProblemMaker automates this process. It has two components:
The result is a set of data problems for the insight discovery task.
In order to demonstrate that the end-to-end process works, we qualified ~100 problems for the V1 of the benchmark, and ran the candidate agent prototypes on these problems.
We are planning to publish the actual benchmark data as well as open-source the generation utility towards the end of the year (see What to Expect at the end of this post)
Evaluation metrics - given a set of ground-truth insights and agent-discovered insights, we calculate:
Coverage - how well the set of insights discovered by a candidate agent cover the set of ground-truth insights:
Naturally, when comparing the evaluation-scores of candidate agents, the limit we enforce on the number of ADIs each of the candidate agents may propose is the same limit.
Finally, we created several LLM-centric agents tasked to discover insights; the purpose of this step is to understand the inherent limitations LLM-powered agents may have in discovering insights.
In particular, we aim to demonstrate in the next project phases, the possibility to leverage the dynamic benchmarking capability to iteratively evolve robust insight discovery agents.
We experimented with 3 variations of gpt-4o: gpt-4o-2024-05-13, gpt-4o-mini and gpt-4o-2024-08-06
This basic version had the following shortcomings:
2. We addressed these issues in a second agent version by adding a detailed “insight discovery” prompt, breaking the task into steps and allowing the agent to send repeat queries to the LLM to address each of the steps. This agent still used gpt4o-2024-05-13.
3. In the third version, we transitioned to gpt-4o-2024-08-06, and were able to simplify the prompt to perform the task in a single step. We also realized that the code generated by the LLM did not always compile and sometimes produced runtime errors, and therefore iterated on the code generation with specific hints for fixing Python errors. The updated revision is:
Our key findings from our initial evaluations on the beta version include:
Reflecting on the challenges we met while developing this benchmarking framework, we have learned that the task of automatically synthesizing data problems for insight discovery is indeed not trivial.
Evaluation of insight-discovery agents against a benchmark is a simpler task. We should note one important challenge:
Finally, as a byproduct of developing baseline agents for the task, we also learned a few lessons:
By the end of the year, we are planning to release the benchmark V1 datasets publicly, and later in 2025 we will open-source the entire framework, both for generating the benchmark (problem specifications and datasets) and the evaluation of candidate agents on the benchmark.
We invite the community to collaborate on the first component of our initiative: generating diverse and meaningful problem specifications for benchmarking insight discovery capabilities. Contribute your own problem specifications, suggest enhancements, or help refine templates to ensure the benchmarks reflect real-world complexities. Your participation will help build a robust, community-driven repository that pushes the boundaries of AI capabilities in insight and pattern discovery. Contact us here.
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis
Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis