Benchmarking LLMs Insight Discovery Capabilities through Synthetic Problem Generation

Introduction

To drive meaningful and continuous improvements in operational KPIs, Generative AI applications must be able to discover insights hidden within operational data. A critical component in developing agents capable of insight discovery is the establishment of robust evaluation and benchmarking frameworks for the task of discovering patterns impacting key metrics / outcomes.

When we refer to insights, we assume that there is a well defined KPI, being measured and tracked - and when we say insights, we mean factors, expressed in a natural language, affecting this KPI. 

It’s important to note that while insight discovery is clearly an important capability, embedded both inside domain-specific/use case-specific solutions and within enterprise BI tools, there is no benchmark to be used to assess such capabilities (unlike a variety of benchmarks for both predictive models and large language models). 

Today, we are introducing the first-ever benchmark designed to evaluate insight discovery capabilities in AI agents. This includes both the benchmark itself and an accompanying evaluation framework with performance metrics. The benchmark is synthetically generated, enabling scalable expansion as more resources become available. We propose a practical approach, focusing on solving real-world problems. This is not a benchmark for predictive modeling; insights are statements in natural language and need to be evaluated as such. In particular, we note that community assets such as some of the predictive modeling benchmarks and problems shared publicly on Kaggle may help someone who is building an insight discovery benchmark. We will discuss what constitutes a good benchmark for insight discovery, how to evaluate agents for this task, and how to leverage a dynamically generated benchmark to auto-improve such agents.

Early results indicate that this framework not only provides a viable way to assess agents’ performance but can also be leveraged to autonomously improve their capabilities. In the sections below we drill down into the metric definitions and the results for the baseline agents.

This article is referring to version zero (or beta) of the framework, which we are finalizing these days. Next week, we will share the problem specifications for this version. Looking forward, we aim to publish version 1 of the benchmarking framework by the end of the year. At that time we will share the entire benchmark - the problem specifications, the problems’ datasets, and the evaluation framework . As part of this effort, we are organizing a workshop on LLM-generated problem specifications for insight discovery. Early collaborators will be invited to contribute to this exciting initiative.

Below you will find the following sections:

  • Motivation
  • Benchmarking framework - technical outline
  • Baseline LLM-centric insight discovery agents
  • Baseline LLM agents - Evaluation  highlights
  • Learnings & What to expect next

Motivation

In recent years, the development of large language models (LLMs) has been guided and assessed using a variety of benchmarks. Leading AI labs like OpenAI, Anthropic, Meta FAIR, and Google Gemini frequently reference several key benchmarks when announcing their models. These benchmarks are critical for measuring the capabilities and performance of their LLMs. However, they have notable gaps, especially concerning the complex requirements of insight discovery tasks.

Common Benchmarks Used By Leading AI Labs

MMLU (Massive Multitask Language Understanding):

  • Purpose: Evaluates models on their ability to handle a wide range of subjects and tasks.
  • Focus and limitations wrt insight discovery: Primarily focuses on knowledge acquisition and understanding across predefined tasks, lacking the complexity of real-world insight discovery problems involving multiple datasets and intricate relationships.

BBH (Big-Bench Hard):

  • Purpose: Tests models on challenging multi-step reasoning tasks.
  • Focus: Tests reasoning capabilities.

HumanEval:

  • Purpose: Assesses coding abilities by having models complete Python programming tasks.
  • Focus and limitations wrt insight discovery: Focuses on code generation and understanding, which, although useful, does not cover the breadth of pattern discovery, such as data handling and statistical analysis.

DROP (Discrete Reasoning Over Paragraphs):

  • Purpose: Measures the ability to perform discrete reasoning and comprehension.
  • Limitations: Emphasizes text comprehension and reasoning.
Image 1: A summary table of benchmark results by Meta FAIR, comparing Llama 3.2 to comparable frontier models.

The paper “Self-Taught Evaluators” by Tianlu Wang et al. presents a method for training LLM evaluators without human-annotated data. Instead of relying on costly human judgments, the authors use an iterative self-training approach with synthetic data. Starting with a base LLM, the model generates preference pairs and trains itself as a judge, iteratively improving performance. This approach matches or surpasses models trained with human data, achieving significant improvements on benchmarks like RewardBench, increasing scores from 75.4 to 88.3. It offers a scalable alternative for evaluating LLMs, particularly for extracting insights from structured data.

Insight Discovery Assessment - Gaps in Current Benchmarks

While these benchmarks have undoubtedly propelled significant advancements in the capabilities of LLMs, they remain limited in scope and do not adequately capture the models’ ability to perform more specialized tasks, such as insight discovery in structured data. Traditional benchmarks typically focus on natural language understanding, generation, and reasoning over unstructured data sources like text and dialogue, but they fail to account for the nuanced challenges involved in extracting meaningful insights from structured datasets such as databases, spreadsheets, and tabular information. As a result, there is a growing need for new benchmarks that can evaluate how effectively LLMs can navigate, interpret, and synthesize information from structured sources, ultimately pushing the boundaries of what these models can achieve in data-driven environments.

The Need for Dynamic Insight Discovery Benchmarks

To address these gaps, there is a pressing need for a dynamic benchmarking framework tailored to insight discovery. This framework should provide both means to automatically generate insight discovery problems, as well as means to evaluate a given agent that claims to perform insight discovery tasks (or in other words, solve insight discovery problems).

In particular, a problem in the benchmark should have 4 components:

  • Natural language description of the problem, including the business/scientific context, the underlying objective (e.g. discover factors affecting a KPI), and related data schema
  • Underlying problem data - one or more data tables, from which the insights should be derived. The primary table should include a target column representing the objective/KPI.
  • Ground-truth insights - the insights that are hidden in the data. These insights are the “official answer” to the problem
  • Underlying ground-truth factors data - the derived data reflecting the calculation of the insights (or more precisely, the quantitative insightful factors) from the problem data.

The benchmarking framework should therefore:

  • Autonomously Generate Problems and Underlying Data: Use generative approaches to create synthetic insight-discovery problems with underlying data and attached solutions (i.e. the 4 components above).some text
    • Integrated ground-truth: Provide ground-truth insights for every problem in the benchmark
  • Include Multi-Table Data: Reflect the complexity of real-world problems involving multiple interconnected datasets.
  • Support Diverse Problem Types: Cover a wide range of pattern discovery tasks including classification, regression, and temporal analysis.

Technical Outline

This section will provide:

  • Requirements and principles for problems, insights, and evaluation metrics
  • An example of a benchmark problem
  • Scope of generated problems
  • High level architecture for synthetic insight-discovery problem generation
  • Description of the data generation process

Requirements and principles

We start with the list of requirements from the benchmark as a collection of problems/datasets and the evaluation framework.

To be explicit, an item in the benchmark would be a set of:

  • Problem specification
  • Datasets for which the task of ‘insight discovery in data’ should be performed

Where the problem specification consists of:

  1. Problem name
  2. Problem domain
  3. Problem description
  4. Required tables (and indication of the primary table)
  5. Target name
  6. Insights to be discovered
  7. Comments - data
  8. Comments - target
  9. Comments - schema

We expect the benchmark generation utility to be able to get as input a problem description per above and:

  • Generate synthetic data as specified. In particular:some text
    • the data should be structured with the tables listed in #4, 
    • the columns in these tables should provide the information required to discover the insights in #6
    • and the target column should represent the desired outcome associated with the KPI
  • Split the data to train and test

A good benchmark should be rich & diverse, robust and realistic. What we mean by this is:

Rich & Diverse:

  • Variety of problem domains and phenomena to be discovered in the data
  • Variety of problem types, where the target variable is numeric, binary or multi-class.
  • Variety in problem complexity - e.g. number of patterns to be discovered, number of tables in the problem, complexity and depth of the pattern expressions, etc.

Robust:

  • Existence of clear “ground-truth” - Benchmark problems need to come with an explicit solution - the set of insights to be found, and the importance of each insight. Note that this is not the case with most “real-world” problems, where there is no agreed-upon, complete reference list of insights to be found.some text
    • Ability to extend the benchmark with more problems similar to a given set of reference problems - we refer to this as “dynamic benchmark generation”.  Why is this important? some text
      • As the ecosystem is experiencing an extremely high rate of changes, in both needs and capabilities (which then give rise to more needs and more capabilities..), a good benchmark would be extensible with additional problems; 
      • moreover, the need for new benchmark problems is likely to be directed at certain types of problems that may be under-represented in a current version of the benchmark.
      • It also happens that this property of the framework would also support auto-improvement of candidate agents for insight discovery, as we will explain below.
  • Given a set of reference problems - the benchmarking framework is required to be able dynamically generate new benchmark problems that are similar to the reference set.
  • Problems should be provided with training and test data

Realistic:

  • Mirror real problems with business relevance and insights that arise in actual use cases
  • Reflect cross-data relationships and patterns that occur naturally in real-world data

Evaluation Framework

In order to assess agents attempting to solve for the insight discovery task, we need an evaluation framework beyond the benchmark itself (the problems and associated data). It will provide the metrics for measuring how well the insights discovered by such an agent cover the ground-truth solution, what is the predictive power of these insights, and ensure proper use of the data (e.g. no temporal leaks). Therefore the evaluation of solutions will be based on these three metrics:

  • Coverage - how well the set of insights discovered by a candidate agent cover the set of ground-truth insights
  • Statistical power - how good is a predictive model trained (with a standard classifier), using the proposed insights as features, in predicting the target on the test-set
  • Proper use of data - for example, when temporal data is provided - is the agent respecting the temporal constraints to avoid use of data that would not be available at time of inference

Example

Here is an example of a problem specification and the data generated for it.

  • Problem name: Diabetes risk
  • Problem domain: Healthcare
  • Problem description: How risk to develop Diabetes type 2 within 12 months is correlated with demographics and medical history
  • Required tables (and indication of the primary table): some text
    • Patients (Primary)
    • Lab tests
    • Lab tests info
  • Target name: has_diabetes
  • Insights/factors to be discovered:some text
    • Number of occurrences (in the last 12 months) of values of Triglycerides over the normal range.
    • Number of abnormal blood test values in the last 24 months.
    • Annual slope of HbA1c in the last 5 years.
  • Comments - data: The data should include medical history per customer for 2018-2022
  • Comments - target: the target would indicate whether the patient developed Diabetes (type 2) in 2023
  • Comments - schema: In lab tests include date, test_id, value, comments

The task is therefore to take the input specification as above, and generate the three data tables (Patients, Lab tests, Lab tests info), where the Patients table is also split to train and test. Part of the output is also the data associated with the ground-truth insights; that is we enrich the train and test tables with columns representing each of the insights/factors that are listed in #6 of the input.

The task is therefore to take the input specification as above, and generate the three data tables (Patients, Lab tests, Lab tests info), where the Patients table is also split to train and test. Part of the output is also the data associated with the ground-truth insights; that is we enrich the train and test tables with columns representing each of the insights/factors that are listed in #6 of the input.

Architecture diagram of the benchmark generation system 

Scope of the first version of the Benchmark

  • Benchmark limitations - for the initial benchmark we are referring to today, we made the following choices:some text
    • We only generate binary-classification problems
    • Each problem has 1-5 insights to be discovered
    • Problems must have at least one insight that strongly correlates with the target.
    • The problems are limited to tabular data, with multiple tables per problem (including time series and lookups).
    • Support only numeric insights / influencing factors 
    • We focus on positively influencing factors affecting the KPI; support for negatively influencing factors will be in the future benchmark versions. 
  • The charts below show the richness of the benchmark wrt problem domain, number of data tables, and number of insights. This demonstrates the richness and complexity of the benchmark - it is composed of problems across 34 different domains; variety of problem complexities reflected 
  • in the number of data tables - a few problems with only two data tables, tens of problems with 3,4,5 tables, and even one table with 10 tables; and a second dimension of problem complexity - number of insights to be discovered - varying from 2 to 6 insights in a single problem.

(*) The above distributions may vary as we finalize the beta version of the benchmark. We will refresh them when we share the problem specifications next week.

Beyond that:

  • This initial beta version of the benchmark consists of 106 problems.
  • The primary table for all problems in the benchmarks has 3,000 rows, 2100 (70%) in the train data, and 900 (30%) in the test data.  
  • The secondary tables have a variable number of rows that matches their role in the data. This number is determined by an LLM. For example a table of products’ information for a small business, will likely have hundreds of rows, while a table of products/items in customer orders will typically have tens of thousands of rows.
  • The number of columns in the primary table (including the target column) varies from 3 to 16, with an average of 7.3. 
  • The total number of data columns per problem (including secondary tables) varies from 15 to 54 with an average of 24.5
  • The mean of the target is approximately 0.2 for most of the problem; we are planning to add more variability when we finalize V1 of the benchmark
    • # of rows - in context tables - explain

Takeaways for the completion of V1:

  • The variety we have in problem domains seem sufficient
  • We plan to keep the problems in V1 as binary classification problems only
  • The variety in number of tables in a problem is good
  • We realize that there is need for higher variability in the complexity and level of difficulty of the benchmark problems, so in V1 we will address this by adding more problems with different level of complexity and difficulty, with emphasis on proper representation of complex and difficult problems

Benchmark creation / data generation 

The creation of the benchmark is composed of 3 steps:

  • Problem specifications - We prepare (manually and programmatically) a set of (structured) problem specifications . In particular, for each problem the specification specifies the business context, the target variable (KPI), and the insights that should be embedded within the generated.
  • Data tables - For each such problem specification we generate the required data tables, with the corresponding connecting keys. The data representing the insights to be discovered (represented by quantitative patterns) are also generated (and will be removed before the benchmark is packaged).
  • Target creation - At the final step, we generate the target column for the primary table, so it is highly correlated with a linear combination of the insight columns

It should be noted that we are not always able to generate data for each of the specified insights. Similarly, we sometimes fail creating a strong target correlation for each of the specified insights, so some of them end up only weakly correlated with the target.

Our (currently internal) project ProblemMaker automates this process. It has two components:

  • Its first component gets as input a list of problem clusters (high level problem archetypes, described in the context of a common business domain)some text
    • For each such cluster it generates multiple problem specifications (that are rather detailed - problem description, target variable (KPI), required tables, required insights (that should be discoverable in the data)..
    • The resulting list of problems is the input for the second component
  • The second component generates the synthetic data: for each problem specification generated above, the second component generates the problem datasets, where the input insights are hidden. This is done in 3 steps:some text
    • First, we infer what should be the columns in each of the tables that are listed in the problems specification, and randomly generate data to populate the tables. Note that in this step we do not generate the target column, and we don’t generate the insight columns
    • We add columns to the primary table, one column for each required insight to be discovered. 
    • We generate the target column so it is correlated with each of the insight columns, and then we remove the insight columns from the primary table

The result is a set of data problems for the insight discovery task.

In order to demonstrate that the end-to-end process works, we qualified ~100 problems for the V1 of the benchmark, and ran the candidate agent prototypes on these problems.

We are planning to publish the actual benchmark data as well as open-source the generation utility towards the end of the year (see What to Expect at the end of this post)

Evaluation metrics - given a set of ground-truth insights and agent-discovered insights, we calculate:

  • Coverage - how well the set of insights discovered by a candidate agent cover the set of ground-truth insights:some text
    • Given a ground-truth insight (GTI) we consider the weight w(GTI) = the correlation (in absolute value) of GTI with the target column 
    • For each ground-truth insight we find the agent-discovered insight that best-approximates it. Given a ground-truth insight (GTI):some text
      • For each of the agent-discovered insights (ADI), we examine the data correlation of ADI with GTI
      • We pick ADI-max(GTI) to be the ADI with the highest correlation (in absolute value of course)
    • The coverage score for the agent is some text
      • sum[w(GTI)*ADI-corr(GTI, ADI-max(GTI)] / sum(w(GTI) over all GTI’s in the ground-truth solution.
  • Predictive-performance / statistical power - how strong is the signal generated by the insights when we consider them as features for building a predictive modelsome text
    • We build a standard predictive model over the problem data augmented by the data generated by the insights discovered by the agent.some text
      • For a standard model, we chose ScikitLearn’s RandomForest with the default parameters.
      • The predictive-performance score is the test-set-AUC of this model
  • Proper use of data - we apply a variety of techniques to verify that the agent is respecting temporal constraints in the data, and avoid use of data that would not be available at time of inference. One of the techniques we are using for example is: for each insight provided by the agent, we ensure that the value of the insight on the training data does not change when we change training data values that are out of temporal score (i.e. in the future) wrt the each primary table row. 
  • is the agent respecting the temporal constraints to avoid use of data that would not be available at time of inference

Naturally, when comparing the evaluation-scores of candidate agents, the limit we enforce on the number of ADIs each of the candidate agents may propose is the same limit.

Baseline LLM-Centric Insight Discovery Agents

Finally, we created several LLM-centric agents tasked to discover insights; the purpose of this step is to understand the inherent limitations LLM-powered agents may have in discovering insights.

In particular, we aim to demonstrate in the next project phases, the possibility to leverage the dynamic benchmarking capability to iteratively evolve robust insight discovery agents. 

LLM's used by the agents

We experimented with 3 variations of gpt-4o: gpt-4o-2024-05-13, gpt-4o-mini and gpt-4o-2024-08-06

The 3 baseline agents for evaluation

  1. We started with a basic agent, using gpt-4o-2024-05-13 with the structure:
  • Inputs: datasets, target variable, additional context
  • Use a simple, direct prompt for insight discovery
  • Generate hypotheses for insights as to what correlates with the target variable based on the data and common-sense knowledge
  • Write code to quantify these insights based on the data
  • Execute the code on the data to evaluate each of the generated insights
  • Output: The top insights according to the evaluation metric.

        This basic version had the following shortcomings:

  • Score for coverage and predictive-performance were low
  • Limited ability to discover complex insights

        2. We addressed these issues in a second agent version by adding a detailed “insight discovery” prompt, breaking the task into steps and allowing the agent to send repeat queries to the LLM to address each of the steps. This agent still used gpt4o-2024-05-13.

        3. In the third version, we transitioned to gpt-4o-2024-08-06, and were able to simplify the prompt to perform the task in a single step. We also realized that the code generated by the LLM did not always compile and sometimes produced runtime errors, and therefore iterated on the code generation with specific hints for fixing Python errors. The updated revision is:

  • Inputs: datasets, target variable, additional context
  • Use a single-step prompt
  • Generate hypotheses for insights as to what correlates with the target variable based on the data and common-sense knowledge
  • Write code to quantify these insights based on the data
  • Execute the code on the data to evaluate each of the generated insights
  • Fix code if there are errors during execution
  • Output: The top insights according to the evaluation metric.

Baseline LLM Agents - Evaluation Highlights

Our key findings from our initial evaluations include:

  • With a basic prompt and basic post-processing of the LLM suggestions, GPT-4o is already able to discover simple insights, but fails when complexity increases; it achieves a cumulative coverage score of 47% and its average predictive-performance score is 62%. 
  • Providing detailed guidance wrt insight discovery methodology into the prompt led to modest performance gains (~5% relative improvement) in both the coverage and predictive-performance metrics.
  • Applying Python code corrections and simplifying agent interaction to a single step resulted in additional improvement, mostly in the coverage score.
  • Baseline agents performed slightly better when tasked with problems embedding multiple insights that strongly correlate with target outcomes.
  • Agents’ scores

Learnings and What to Expect Next

Reflecting on the challenges we met while developing this benchmarking framework, we have learned that the task of automatically synthesizing data problems for insight discovery is indeed not trivial.

  • Starting with creating problem specifications - it is challenging to:some text
    • Generate variability of problems that are not artificial repetitions of a few reference problems
    • Assess the complexity of a problem by its specification, in order to ensure that the benchmark has sufficient complexity variability
    • Produce realistic problem descriptions that also make business/scientific sense
  • Generating synthetic data for a given problem specification is even a harder task - here are our top level learnings:some text
    • LLMs are still challenged by code generation tasks; getting python code that runs without crashing requires multiple iterations, and even then successful completion is not guaranteed. 
    • Validating that the LLM-generated code actually performs what it was told to do is hard. We were not successful at this, and decided to put it on hold for future work. The current version does not validate that the code generation of the hidden insights or factors is compatible with the natural language descriptions in the problem specification
    • One should pay attention to the direction of impact of a given factor on the target. For V0 and V1, we decided to focus on the amplifying insights (or positively correlated factors). 
  • Realistic synthetic data - there are several aspects and dimensions that make synthetic data look synthetic. When data of aggregated nature is generated synthetically, it’s virtually impossible to ensure consistency. Instead, our approach is to synthesize data is at the most granular level, and calculate/infer the different type of aggregations from the raw data
  • We considered multiple strategies for embedding the ground-truth insights into the data. The one we implemented ended up being a successful heuristics - we generate the base tables, then calculate the insight columns deterministically from these tables, and then generate a target column that is correlated with the insight columns. Finally we remove these columns so the insights are hidden and need to be discovered. Additional strategies should be explored and evaluated, and further research is required to understand whether different strategies may produce more realistic data, as well as allow for more challenging problems to be created.
  • While we are happy with the beta version, we think that V1 needs to better span the range of complexity and difficulty of problems. We expect that overall V1 will be a more challenging, complex and difficult benchmark.

Evaluation of insight-discovery agents against a benchmark is a simpler task. We should note one important challenge:

  • It is still required to construct a robust metric to assess whether an insight in the ground-truth solution is covered by the solution (set of insights) proposed by an agent
  • Moreover, even when the insight descriptions are compatible, it is even a harder challenge to validate that the code supporting the insight is indeed compatible with the description!

Finally, as a byproduct of developing baseline agents for the task, we also learned a few lessons:

  • GPT4o is familiar with data-science related tasks, so extensive prompt engineering is less relevant for this problem; the improvement we achieved with a more elaborated guidance on how to discover insights in the data was marginal (~5%)
  • What helped much more was applying a round of code fixes to the functions generated by the LLM before making the final insight selection. That increased the number of candidate insights significantly, and the performance of an agent equipped with the ability to fix errors in the code was much better.
  • We used OpenAI Assistants API for developing the baseline agents, this decision should be re-evaluated. We should take into account the intent of OpenAI to longterm support this API for agent development

What to expect next?

By the end of the year, we are planning to release the benchmark V1 datasets publicly, and later in 2025 we will open-source the entire framework, both for generating the benchmark (problem specifications and datasets) and the evaluation of candidate agents on the benchmark.

  • V2 of the benchmark will take advantage of o1 in the code generation for the synthetic data generation as well as the generation of the hidden patterns
  • We will expand the benchmark with significantly more problems, as well as problem variation and complexity
  • Address the following limitations outlined in the technical outline section

Features

No items found.
No items found.

Join the Effort to Enhance AI Benchmarking

We invite the community to collaborate on the first component of our initiative: generating diverse and meaningful problem specifications for benchmarking insight discovery capabilities. Contribute your own problem specifications, suggest enhancements, or help refine templates to ensure the benchmarks reflect real-world complexities. Your participation will help build a robust, community-driven repository that pushes the boundaries of AI capabilities in insight and pattern discovery. Contact us here.

Related Articles

No Related Articles Found

It was easier in this project since we used this outpout

Business Insights

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Predictive Models

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Micro-Segments

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Features For
External Models

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Business Automation
Rules

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Root-Cause
Analysis

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Join our event about this topic today.

Learn all about the SparkBeyond mission and product vision.

RVSP
Arrow