→

Benchmarking LLMs Insight Discovery Capabilities through Synthetic Problem Generation

AI Solutions for Always-Optimized Operations

Benchmarking LLMs Insight Discovery Capabilities through Synthetic Problem Generation

October 30, 2024

Introduction

To drive meaningful and continuous improvements in operational KPIs, Generative AI applications must be able to discover insights hidden within operational data. A critical component in developing agents capable of insight discovery is the establishment of robust evaluation and benchmarking frameworks for the task of discovering patterns impacting key metrics / outcomes.

When we refer to insights, we assume that there is a well defined KPI, being measured and tracked - and when we say insights, we mean factors, expressed in a natural language, affecting this KPI.

It’s important to note that while insight discovery is clearly an important capability, embedded both inside domain-specific/use case-specific solutions and within enterpriseBI tools, there is no benchmark to be used to assess such capabilities (unlike a variety of benchmarks for both predictive models and large language models).

Today, we are introducing the first-ever benchmark designed to evaluate insight discovery capabilities in AI agents. This includes both the benchmark itself and an accompanying evaluation framework with performance metrics. The benchmark is synthetically generated, enabling scalable expansion as more resources become available. We propose a practical approach, focusing on solving real-world problems. This is not a benchmark for predictive modeling; insights are statements in natural language and need to be evaluated as such. In particular, we note that community assets such as some of the predictive modeling benchmarks and problems shared publicly on Kaggle may help someone who is building an insight discovery benchmark. We will discuss what constitutes a good benchmark for insight discovery, how to evaluate agents for this task, and how to leverage a dynamically generated benchmark to auto-improve such agents.

Early results indicate that this framework not only provides a viable way to assess agents’ performance but can also be leveraged to autonomously improve their capabilities. In the sections below we drill down into the metric definitions and the results for the baseline agents.

This article is referring to version zero (or beta) of the framework, which we are finalizing these days. Next week, we will share the problem specifications for this version. Looking forward, we aim to publish version 1 of the benchmarking framework by the end of the year. At that time we will share the entire benchmark - the problem specifications, the problems’ datasets, and the evaluation framework . As part of this effort, we are organizing a workshop onLLM-generated problem specifications for insight discovery. Early collaborators will be invited to contribute to this exciting initiative.

Last week, as we were making this benchmarking framework ready to share, OpenAI announced MLE-Bench– a benchmark for evaluating machine-learning agents on machine-learning-engineering tasks. This is not only another indication that this type of task (machine learning engineering, insight discovery, etc.) is of high interest to the community. You will find a paragraph regarding the synergies between these two benchmarks in the next section.

Below you will find the following sections:

Motivation
Benchmarking framework - technical outline
Baseline LLM-centric insight discovery agents
Baseline LLM agents - Evaluation highlights
Learnings & What to expect next

Motivation

In recent years, the development of large language models (LLMs) has been guided and assessed using a variety of benchmarks. Leading AI labs like OpenAI, Anthropic, Meta FAIR, and Google Gemini frequently reference several key benchmarks when announcing their models. These benchmarks are critical for measuring the capabilities and performance of their LLMs. However, they have notable gaps, especially concerning the complex requirements of insight discovery tasks.

Common Benchmarks Used By Leading AI Labs

MMLU (Massive Multitask Language Understanding):

Purpose: Evaluates models on their ability to handle a wide range of subjects and tasks.
Focus and limitations wrt insight discovery: Primarily focuses on knowledge acquisition and understanding across predefined tasks, lacking the complexity of real-world insight discovery problems involving multiple datasets and intricate relationships.

BBH (Big-Bench Hard):

Purpose: Tests models on challenging multi-step reasoning tasks.
Focus: Tests reasoning capabilities.

HumanEval:

Purpose: Assesses coding abilities by having models complete Python programming tasks.
Focus and limitations wrt insight discovery: Focuses on code generation and understanding, which, although useful, does not cover the breadth of pattern discovery, such as data handling and statistical analysis.

DROP (Discrete Reasoning Over Paragraphs):

Purpose: Measures the ability to perform discrete reasoning and comprehension.
Limitations: Emphasizes text comprehension and reasoning.

MLE-Bench (evaluate agents on ML engineering tasks):‍

A curated collection of 75 real-world problems from Kaggle, covering a wide variety of data formats (images, audio, text and tabular data). In addition to the benchmark, the authors discuss approaches for constructing ML agents and evaluate the performance of several candidates on their benchmark.

‍Purpose: evaluate MLE agents
‍Limitations: Emphasis on predictive modeling tasks not including insight discovery.Small portion of the problems involve tabular data.

The paper “Self-Taught Evaluators” by Tianlu Wang et al. presents a method for training LLM evaluators without human-annotated data. Instead of relying on costly human judgments, the authors use an iterative self-training approach with synthetic data. Starting with a base LLM, the model generates preference pairs and trains itself as a judge, iteratively improving performance. This approach matches or surpasses models trained with human data, achieving significant improvements on benchmarks like RewardBench, increasing scores from 75.4 to 88.3. It offers a scalable alternative for evaluating LLMs, particularly for extracting insights from structured data.

Insight Discovery Assessment - Gaps in Current Benchmarks

While these benchmarks have undoubtedly propelled significant advancements in the capabilities of LLMs, they remain limited in scope and do not adequately capture the models’ ability to perform more specialized tasks, such as insight discovery in structured data. Traditional benchmarks typically focus on natural language understanding, generation, and reasoning over unstructured data sources like text and dialogue, but they fail to account for the nuanced challenges involved in extracting meaningful insights from structured datasets such as databases, spreadsheets, and tabular information. As a result, there is a growing need for new benchmarks that can evaluate how effectively LLMs can navigate, interpret, and synthesize information from structured sources, ultimately pushing the boundaries of what these models can achieve in data-driven environments.

The Need for Dynamic Insight Discovery Benchmarks

To address these gaps, there is a pressing need for a dynamic benchmarking framework tailored to insight discovery. This framework should provide both means to automatically generate insight discovery problems, as well as means to evaluate a given agent that claims to perform insight discovery tasks (or in other words, solve insight discovery problems).

In particular, a problem in the benchmark should have 4 components:

Natural language description of the problem, including the business/scientific context, the underlying objective (e.g. discover factors affecting a KPI), and related data schema
Underlying problem data - one or more data tables, from which the insights should be derived. The primary table should include a target column representing the objective/KPI.
Ground-truth insights - the insights that are hidden in the data. These insights are the “official answer” to the problem
Underlying ground-truth factors data - the derived data reflecting the calculation of the insights (or more precisely, the quantitative insightful factors) from the problem data.

The benchmarking framework should therefore:

Autonomously Generate Problems and Underlying Data: Use generative approaches to create synthetic insight-discovery problems with underlying data and attached solutions (i.e. the 4 components above).
- Integrated ground-truth: Provide ground-truth insights for every problem in the benchmark
Include Multi-Table Data: Reflect the complexity of real-world problems involving multiple interconnected datasets.
Support Diverse Problem Types: Cover a wide range of pattern discovery tasks including classification, regression, and temporal analysis.

We are interested in exploring synergies with related benchmarking frameworks, and in particular explore how to leverage benchmarks based on real world problems (such asMLE-Bench) to automatically generate a scalable benchmark with real-world-like properties.

Technical Outline

This section will provide:

Requirements and principles for problems, insights, and evaluation metrics
An example of a benchmark problem
Scope of generated problems
High level architecture for synthetic insight-discovery problem generation
Description of the data generation process

Requirements and principles

We start with the list of requirements from the benchmark as a collection of problems/datasets and the evaluation framework.

To be explicit, an item in the benchmark would be a set of:

Problem specification
Datasets for which the task of ‘insight discovery in data’ should be performed

Where the problem specification consists of:

Problem name
Problem domain
Problem description
Required tables (and indication of the primary table)
Target name
Insights to be discovered
Comments - data
Comments - target
Comments - schema

We expect the benchmark generation utility to be able to get as input a problem description per above and:

Generate synthetic data as specified. In particular:
- the data should be structured with the tables listed in #4,
- the columns in these tables should provide the information required to discover the insights in #6
- and the target column should represent the desired outcome associated with the KPI
Split the data to train and test

A good benchmark should be rich & diverse, robust and realistic. What we mean by this is:

Rich & Diverse:

Variety of problem domains and phenomena to be discovered in the data
Variety of problem types, where the target variable is numeric, binary or multi-class.
Variety in problem complexity - e.g. number of patterns to be discovered, number of tables in the problem, complexity and depth of the pattern expressions, etc.

Robust:

Existence of clear “ground-truth” - Benchmark problems need to come with an explicit solution - the set of insights to be found, and the importance of each insight. Note that this is not the case with most “real-world” problems, where there is no agreed-upon, complete reference list of insights to be found.
- Ability to extend the benchmark with more problems similar to a given set of reference problems - we refer to this as “dynamic benchmark generation”. Why is this important?
  - As the ecosystem is experiencing an extremely high rate of changes, in both needs and capabilities (which then give rise to more needs and more capabilities..), a good benchmark would be extensible with additional problems;
  - moreover, the need for new benchmark problems is likely to be directed at certain types of problems that may be under-represented in a current version of the benchmark.
  - It also happens that this property of the framework would also support auto-improvement of candidate agents for insight discovery, as we will explain below.
Given a set of reference problems - the benchmarking framework is required to be able dynamically generate new benchmark problems that are similar to the reference set.
Problems should be provided with training and test data

Realistic:

Mirror real problems with business relevance and insights that arise in actual use cases
Reflect cross-data relationships and patterns that occur naturally in real-world data

Evaluation Framework

In order to assess agents attempting to solve for the insight discovery task, we need an evaluation framework beyond the benchmark itself (the problems and associated data). It will provide the metrics for measuring how well the insights discovered by such an agent cover the ground-truth solution, what is the predictive power of these insights, and ensure proper use of the data (e.g. no temporal leaks). Therefore the evaluation of solutions will be based on these three metrics:

Coverage - how well the set of insights discovered by a candidate agent cover the set of ground-truth insights
Statistical power - how good is a predictive model trained (with a standard classifier), using the proposed insights as features, in predicting the target on the test-set
Proper use of data - for example, when temporal data is provided - is the agent respecting the temporal constraints to avoid use of data that would not be available at time of inference

Example

Here is an example of a problem specification and the data generated for it.

Problem name: Diabetes risk
Problem domain: Healthcare
Problem description: How risk to develop Diabetes type 2 within 12 months is correlated with demographics and medical history
Required tables (and indication of the primary table): some text
- Patients (Primary)
- Lab tests
- Lab tests info
Target name: has_diabetes
Insights/factors to be discovered:
- Number of occurrences (in the last 12 months) of values of Triglycerides over the normal range.
- Number of abnormal blood test values in the last 24 months.
- Annual slope of HbA1c in the last 5 years.
Comments - data: The data should include medical history per customer for 2018-2022
Comments - target: the target would indicate whether the patient developed Diabetes (type 2) in 2023
Comments - schema: In lab tests include date, test_id, value, comments

The task is therefore to take the input specification as above, and generate the three data tables (Patients, Lab tests, Lab tests info), where the Patients table is also split to train and test. Part of the output is also the data associated with the ground-truth insights; that is we enrich the train and test tables with columns representing each of the insights/factors that are listed in #6 of the input.

Architecture diagram of the benchmark generation system

Scope of the first version of the Benchmark

We refer to the current version of the benchmark as a beta version. Version 1 will be published around EOY 2024.
Benchmark limitations - for the initial benchmark we are referring to today, we made the following choices:
- We only generate binary-classification problems
- Each problem has 1-5 insights to be discovered
- Problems must have at least one insight that strongly correlates with the target.
- The problems are limited to tabular data, with multiple tables per problem (including time series and lookups).
- Support only numeric insights / influencing factors
- We focus on positively influencing factors affecting the KPI; support for negatively influencing factors will be in the future benchmark versions.

It is important to recognize that for several problems in the benchmark we include two problem variations, where the core underlying problem (and data) are almost the same, where the differences between such problem variations are in the generation of the target and often also in a different list of insights to discover.
The charts below show the richness of the benchmark wrt problem domain, number of data tables, and number of insights. This demonstrates the richness and complexity of the benchmark - it is composed of problems across 34 different domains;

The variety of problem complexities is also reflected in the number of data tables per problem - a few problems with only two data tables, tens of problems with 3,4,5 tables, and few with 6 tables

Another dimension of problem complexity is the number of insights to be discovered - varying from1 to 6 insights in a single problem

Beyond that:

This initial beta version of the benchmark consists of 225 problems.
The primary table for all problems in the benchmarks has 3,000 rows, 2100 (70%) in the train data, and 900 (30%) in the test data.
The secondary tables have a variable number of rows that matches their role in the data. This number is determined by an LLM. For example, a table of products’ information for a small business will likely have hundreds of rows, while a table of products/items in customer orders will typically have tens of thousands of rows.
The number of columns in the primary table (including the target column) varies from 4 to 14, with an average of 7.14.
The total number of data columns per problem (including secondary tables) varies from 13 to 35 with an average of 23.48
The mean of the target is approximately 0.2 for most of the problem; we are planning to add more variability when we finalize V1 of the benchmark

Beta Takeaways and plans V1 completion:

The variety we have in problem domains seem sufficient
We plan to keep the problems in V1 as binary classification problems only
The variety in number of tables in a problem is good
We realize that there is need for higher variability in the complexity and level of difficulty of the benchmark problems, so in V1 we will address this by adding more problems with different level of complexity and difficulty, with emphasis on proper representation of complex and difficult problems

Benchmark creation / data generation

The creation of the benchmark is composed of 3 steps:

Problem specifications - We prepare (manually and programmatically) a set of (structured) problem specifications . In particular, for each problem the specification specifies the business context, the target variable (KPI), and the insights that should be embedded within the generated.
Data tables - For each such problem specification we generate the required data tables, with the corresponding connecting keys. The data representing the insights to be discovered (represented by quantitative patterns) are also generated (and will be removed before the benchmark is packaged).
Target creation - At the final step, we generate the target column for the primary table, so it is highly correlated with a linear combination of the insight columns

It should be noted that we are not always able to generate data for each of the specified insights. Similarly, we sometimes fail creating a strong target correlation for each of the specified insights, so some of them end up only weakly correlated with the target.

Our (currently internal) project ProblemMaker automates this process. It has two components:

Its first component gets as input a list of problem clusters (high level problem archetypes, described in the context of a common business domain)
- For each such cluster it generates multiple problem specifications (that are rather detailed - problem description, target variable (KPI), required tables, required insights (that should be discoverable in the data)..
- The resulting list of problems is the input for the second component
The second component generates the synthetic data: for each problem specification generated above, the second component generates the problem datasets, where the input insights are hidden. This is done in 3 steps:
- First, we infer what should be the columns in each of the tables that are listed in the problems specification, and randomly generate data to populate the tables. Note that in this step we do not generate the target column, and we don’t generate the insight columns
- We add columns to the primary table, one column for each required insight to be discovered.
- We generate the target column so it is correlated with each of the insight columns, and then we remove the insight columns from the primary table

The result is a set of data problems for the insight discovery task.

In order to demonstrate that the end-to-end process works, we qualified ~100 problems for the V1 of the benchmark, and ran the candidate agent prototypes on these problems.

We are planning to publish the actual benchmark data as well as open-source the generation utility towards the end of the year (see What to Expect at the end of this post)

Evaluation metrics - given a set of ground-truth insights and agent-discovered insights, we calculate:

Coverage - how well the set of insights discovered by a candidate agent cover the set of ground-truth insights:

Given a ground-truth insight (GTI) we consider the weight w(GTI) = the correlation (in absolute value) of GTI with the target column
For each ground-truth insight we find the agent-discovered insight that best-approximates it. Given a ground-truth insight (GTI):
- For each of the agent-discovered insights (ADI), we examine the data correlation of ADI with GTI
- We pick ADI-max(GTI) to be the ADI with the highest correlation (in absolute value of course)
The coverage score for the agent is some text
- sum[w(GTI)*ADI-corr(GTI, ADI-max(GTI)] / sum(w(GTI) over all GTI’s in the ground-truth solution.
Predictive-performance / statistical power - how strong is the signal generated by the insights when we consider them as features for building a predictive model
- We build a standard predictive model over the problem data augmented by the data generated by the insights discovered by the agent.
  - For a standard model, we chose ScikitLearn’s RandomForest with the default parameters.
  - The predictive-performance score is the test-set-AUC of this model
Proper use of data - we apply a variety of techniques to verify that the agent is respecting temporal constraints in the data, and avoid use of data that would not be available at time of inference. One of the techniques we are using for example is: for each insight provided by the agent, we ensure that the value of the insight on the training data does not change when we change training data values that are out of temporal score (i.e. in the future) wrt to each primary table row.
is the agent respecting the temporal constraints to avoid use of data that would not be available at time of inference

Naturally, when comparing the evaluation-scores of candidate agents, the limit we enforce on the number of ADIs each of the candidate agents may propose is the same limit.

Baseline LLM-Centric Insight Discovery Agents

Finally, we created several LLM-centric agents tasked to discover insights; the purpose of this step is to understand the inherent limitations LLM-powered agents may have in discovering insights.

In particular, we aim to demonstrate in the next project phases, the possibility to leverage the dynamic benchmarking capability to iteratively evolve robust insight discovery agents.

LLM's used by the agents

We experimented with 3 variations of gpt-4o: gpt-4o-2024-05-13, gpt-4o-mini and gpt-4o-2024-08-06

The 3 baseline agents for evaluation

We started with a basic agent, using gpt-4o-2024-05-13 with the structure:

Inputs: datasets, target variable, additional context
Use a simple, direct prompt for insight discovery
Generate hypotheses for insights as to what correlates with the target variable based on the data and common-sense knowledge
Write code to quantify these insights based on the data
Execute the code on the data to evaluate each of the generated insights
Output: The top insights according to the evaluation metric.

This basic version had the following shortcomings:

Score for coverage and predictive-performance were low
Limited ability to discover complex insights

2. We addressed these issues in a second agent version by adding a detailed “insight discovery” prompt, breaking the task into steps and allowing the agent to send repeat queries to the LLM to address each of the steps. This agent still used gpt4o-2024-05-13.

3. In the third version, we transitioned to gpt-4o-2024-08-06, and were able to simplify the prompt to perform the task in a single step. We also realized that the code generated by the LLM did not always compile and sometimes produced runtime errors, and therefore iterated on the code generation with specific hints for fixing Python errors. The updated revision is:

Inputs: datasets, target variable, additional context
Use a single-step prompt
Generate hypotheses for insights as to what correlates with the target variable based on the data and common-sense knowledge
Write code to quantify these insights based on the data
Execute the code on the data to evaluate each of the generated insights
Fix code if there are errors during execution
Output: The top insights according to the evaluation metric.

Baseline LLM Agents - Evaluation Highlights

Our key findings from our initial evaluations on the beta version include:

With a basic prompt and basic post-processing of the LLM suggestions, GPT-4o is already able to discover simple insights, but fails when complexity increases; it achieves a cumulative coverage score of 32% and its average predictive-performance score is 61%.
Providing detailed guidance wrt insight discovery methodology into the prompt had no significant effect and led to similar performance in the coverage and predictive-performance metrics
Applying Python code corrections and simplifying agent interaction to a single step resulted in additional improvement, mostly in the coverage score.
Baseline agents performed slightly better when tasked with problems embedding multiple insights that strongly correlate with target outcomes.
Agents’ scores
- We had two agent-evaluation iterations, one on the alpha version of the benchmark and one on the beta version. The baseline agents (named Simple Prompt, Extended Prompt, and Single-step with code correction) have not evolved wrt the transition from benchmark-alpha to benchmark beta.
- Below are the baseline agent evaluations on the Beta version

Learnings & What To Expect Next

Reflecting on the challenges we met while developing this benchmarking framework, we have learned that the task of automatically synthesizing data problems for insight discovery is indeed not trivial.

Starting with creating problem specifications - it is challenging to:
- Generate variability of problems that are not artificial repetitions of a few reference problems
- Assess the complexity of a problem by its specification, in order to ensure that the benchmark has sufficient complexity variability
- Produce realistic problem descriptions that also make business/scientific sense
Generating synthetic data for a given problem specification is even a harder task - here are our top level learnings:
- LLMs are still challenged by code generation tasks; getting python code that runs without crashing requires multiple iterations, and even then successful completion is not guaranteed.
- Validating that the LLM-generated code actually performs what it was told to do is hard. We were not successful at this, and decided to put it on hold for future work. The current version does not validate that the code generation of the hidden insights or factors is compatible with the natural language descriptions in the problem specification
- One should pay attention to the direction of impact of a given factor on the target. For V0 and V1, we decided to focus on the amplifying insights (or positively correlated factors).
Realistic synthetic data - there are several aspects and dimensions that make synthetic data look synthetic. When data of aggregated nature is generated synthetically, it’s virtually impossible to ensure consistency. Instead, our approach is to synthesize data is at the most granular level, and calculate/infer the different type of aggregations from the raw data
We considered multiple strategies for embedding the ground-truth insights into the data. The one we implemented ended up being a successful heuristics - we generate the base tables, then calculate the insight columns deterministically from these tables, and then generate a target column that is correlated with the insight columns. Finally we remove these columns so the insights are hidden and need to be discovered. Additional strategies should be explored and evaluated, and further research is required to understand whether different strategies may produce more realistic data, as well as allow for more challenging problems to be created.
While we are happy with the beta version, we think that V1 needs to better span the range of complexity and difficulty of problems. We expect that overall V1 will be a more challenging, complex and difficult benchmark.

Evaluation of insight-discovery agents against a benchmark is a simpler task. We should note one important challenge:

It is still required to construct a robust metric to assess whether an insight in the ground-truth solution is covered by the solution (set of insights) proposed by an agent
Moreover, even when the insight descriptions are compatible, it is even a harder challenge to validate that the code supporting the insight is indeed compatible with the description!

Finally, as a byproduct of developing baseline agents for the task, we also learned a few lessons:

GPT4o is familiar with data-science related tasks, so extensive prompt engineering is less relevant for this problem; the improvement we achieved with a more elaborated guidance on how to discover insights in the data was marginal (~5%)
What helped much more was applying a round of code fixes to the functions generated by the LLM before making the final insight selection. That increased the number of candidate insights significantly, and the performance of an agent equipped with the ability to fix errors in the code was much better.
We used OpenAI Assistants API for developing the baseline agents, this decision should be re-evaluated. We should take into account the intent of OpenAI to longterm support this API for agent development

What to expect next?

By the end of the year, we are planning to release the benchmark V1 datasets publicly, and later in 2025 we will open-source the entire framework, both for generating the benchmark (problem specifications and datasets) and the evaluation of candidate agents on the benchmark.

V2 of the benchmark will take advantage of o1 in the code generation for the synthetic data generation as well as the generation of the hidden patterns
We will expand the benchmark with significantly more problems, as well as problem variation and complexity
Address the following limitations outlined in the technical outline section

This article is part of our “Always-Optimized” series exploring how businesses can harness AI technologies to drive continuous improvement across operations and strategy.

SparkBeyond Team

Unlock the full potential of your business with SparkBeyond’s Always-Optimized platform.

request a customized demo.

About SparkBeyond

SparkBeyond delivers AI for Always-Optimized operations. Our Always-Optimized™ platform extends Generative AI's reasoning capabilities to KPI optimization, enabling enterprises to constantly monitor performance metrics and receive AI-powered recommendations that drive measurable improvements across operations.
‍
The Always-Optimized™ platform combines battle-tested machine learning techniques for structured data analysis with Generative AI capabilities, refined over more than a decade of enterprise deployments. Our technology enables dynamic feature engineering, automatically discovering complex patterns across disparate data sources and connecting operational metrics with contextual factors to solve the hardest challenges in customer and manufacturing operations. Since 2013, SparkBeyond has delivered over $1B in operational value for hundreds of Fortune 500 companies and partners with leading System Integrators to ensure seamless deployment across customer and manufacturing operations. Learn more at SparkBeyond.com or follow us on LinkedIn.

No Related Articles Found