Can LLMs realistically simulate survey responses?

Author: Federico Castelli
Date: 27-03-2026

In the modern world data represent digital gold for companies. Having access to a large amount of information about the market and about the consumers allows to make better strategic decisions. This is especially true in fast-moving consumer goods markets, where the launch of new products is usually preceded by market research that are time-consuming and expensive. However, these economic and time-related costs aren’t the same for all firms.

For small and medium-sized enterprises, conducting market research might requires resources that they simply do not have. This is a major issue, particularly in the Italian context, where around 99.9% of firms are SMEs. As a result, the growth potential of these businesses is significantly limited.

To address this problem, the study explores the use of Large Language Models (LLMs), a technology that has experienced remarkable growth in recent years. So, the main research question that has guided the work is: “Can LLMs realistically simulate survey responses?”

Previous studies have already investigated similar topics, but none of them focused on developing a simple, low-cost, and easily replicable method. In other words, there was no existing framework specifically designed for data-poor environments, such as those faced by SMEs.

The research started by developing an operational method based on a modular prompt, with the goal of making it instantly replicable by any company. The prompt was composed of several elements:

  • First, a set of macroeconomic and demographic data obtained from publicly available databases, such as ISTAT.
  • The data obtained have been combined with a series of rules aimed at ensuring both variability and internal coherence.
  • A key element was a clear description of the objective of the simulation.
  • Finally, the last component consisted of a variable amount of real data taken directly from the survey itself; the specific use of these data is explained in the following section.

To maximize the insights obtained from the research, three main factors were varied.

The first factor was the quantity and quality of the data provided. Four different stages with increasing levels of information have been used. The first stage (“Blind”) provided no real data at all to the AI. Followed the “Semi-informed” and “Guided” stages, where progressively more real data about the target population was introduced. Finally, the “Complete” stage included not only all available real data on the target (already present in the “Guided” stage), but also real data related to non-target respondents.

The second factor was the number of simulated observations generated, in this article referred to as “n”. For each informational stage, four different values of n were tested: 50, 150, 400, and 850.

The third factor concerned the informational context, meaning whether changing the chat session had a significant impact on the results.

To ensure scientific rigor, the real dataset have been split into two parts. One part, called the test dataset, was kept aside and used exclusively for statistical testing. The other part, the training dataset, was provided to the AI according to the four informational stages described above (“Blind”, “Semi-informed”, “Guided”, and “Complete”).

Four different statistical tests were used to evaluate the results. Three of them are specific tests applied to individual questions, depending on their type. The fourth is a global test that considers the entire dataset, allowing for a multivariate evaluation of the simulation.

The question-specific tests are the following:

  • Chi-square test, combined with Cramer’s V as an effect size measure. This test was used for categorical questions to assess whether there were significant differences between AI-generated responses and those provided by real respondents. Cramer’s V was used to make results independent from the number of generated responses, as increasing n automatically affects p-values in the Chi-square test.
  • Mann–Whitney U test, combined with two effect size measures (r and Cliff’s δ). This test was selected for ordinal questions, such as Likert scales. It assesses whether AI-generated responses could plausibly come from the same population as the real responses. In simple terms, it checks whether simulated answers fall within a reasonable range of real ones.
  • Kolmogorov–Smirnov (KS) test, combined with the Wasserstein distance, used for discrete numerical questions. These tests compare entire distributions, evaluating both the shape of the cumulative distribution curves and the distance between them. If both the difference and the distance are minimal, the two datasets can be considered equivalent, meaning that the AI is able to respond realistically.

For the global evaluation, a two-sample classifier test based on the Area Under the Curve (AUC) was used. The classifier’s only task is to distinguish between real and simulated responses. If it fails to do so, the generated responses can be considered indistinguishable from real ones.

Results are presented following the order of the tests.

Regarding the Chi-square test, four main findings emerged. First, the quality and quantity of information provided to the AI proved to be the most important factor. The worst results were observed in the “Blind” stage, where the AI had no access to real data.

Second, providing too much real data reduced response variability. The “Semi-informed” stage showed the best balance between accuracy and variability. The “Complete” stage achieved higher accuracy but lower variability, while the “Guided” stage suffered from excessive constraints, leading to overly uniform responses. This suggests that AI models require diverse sources to generate varied responses, as happens in real-world data.

The third observation concerns the value of n. Increasing the number of simulated observations stabilizes the results, but this effect becomes marginal beyond 400 simulations.

Finally, maintaining a consistent informational context, by using the same chat session, improved coherence and stability.

The Mann–Whitney U test produced similar results, leading to the same general conclusions: the model generalizes well, information quality is the key driver, and n mainly stabilizes results. However, additional insights emerged depending on question complexity. Simpler questions converged more quickly to acceptable results, while more complex ones required higher values of n and more advanced informational stages. For ranking-type questions, only the “Complete” stage produced strong results, while increasing n alone did not improve performance.

The KS and Wasserstein tests largely confirmed the findings of the previous tests. One additional insight is that increasing sample size has a stronger impact on questions with shorter scales. Unlike other tests, changes in informational context had a more limited effect here.

The global test, however, yielded negative results. The binary classifier was almost always able to distinguish between real and simulated data. Since AUC values range from 0 to 1, with 1 indicating perfect classification, acceptable results would require values below 0.5. In this study, results were almost always above 0.85, regardless of the conditions used. Discrete numerical questions were the most discriminative variables. Removing them improved results but did not bring AUC values below 0.5. Moreover, excluding these questions would significantly limit the practical applicability of the approach. From a multivariate perspective, the simulated dataset therefore fails the indistinguishability test.

The overall verdict of the research is negative: ChatGPT is currently not able to generate fully human-like survey responses using prompt engineering alone. The main issue lies in limited response variability, which makes simulated answers easy to distinguish from real ones. However, at a univariate level, many results pass statistical tests. This suggests that AI can be used to simulate responses to individual questions, although it is recommended to generate a limited number of responses, as they may still introduce noise into real data.

Despite the negative overall outcome, the study provides important methodological contributions to the state of art. The main one is the development of a replicable workflow for testing LLM-based simulations. The framework accounts for multiple adjustable factors—sample size, information quality and quantity, and informational context—while also controlling for variability. It combines question-specific univariate tests with a global multivariate test, offering a multi-level evaluation approach. Additionally, the study proposes a first alternative to the costly synthetic persona method. This approach does not require large datasets and is therefore more suitable for SMEs, particularly in the Italian context, where this field is still underexplored.

From an empirical perspective, the study shows that LLMs are good at reproducing central tendencies but struggle to capture the full range of real preferences. Discrete numerical questions and intragroup variability represent the most critical weaknesses. These issues result in a loss of distribution extremes and overly similar responses. Overall, the AI tends to oversimplify answers, reducing realism driven by individual preferences. At present, this type of approach can be used to obtain a high-level market overview, but the generated data should not be treated as fully reliable.

The managerial implications are significant. The key takeaway is that LLMs cannot replace real samples. However, they can still serve as a valuable support tool, for example as a preliminary hypothesis generator, or a tool to support exploratory decision-making.

Finally, it is important to acknowledge the study’s limitations and outline future research directions. First, only one AI model—ChatGPT—was tested. Using other models (such as Gemini or DeepSeek) could lead to different, potentially better results due to their underlying architectures. Second, the analysis focused on a single market. Future research should test different markets and populations to assess the robustness of the findings. The final limitation concerns the exclusive use of prompt engineering, chosen to keep the approach accessible to SMEs. Hybrid models combining light fine-tuning with controlled variance in synthetic datasets could generate more realistic responses, although developing such systems would likely be beyond the direct reach of SMEs. That said, offering these solutions as low-cost services could still make them accessible and valuable.

In conclusion, we can say that artificial intelligence is not yet capable of replacing reality, but it can serve as an excellent strategic tool during the exploratory phases of research. The work presented here should not be viewed merely as a theoretical exercise; rather, it warrants close attention, as it is only a matter of time before LLM models are capable of generating highly realistic summaries.

Recommended Posts