Get Reliable LLM Results: Make It Write Scripts, Not Answers

June 30, 2026llm

Large language models are statistical tools. Every response they produce is sampled from a probability distribution over tokens, which means the same prompt can yield different answers, subtle hallucinations can slip in without warning, and there is no built-in mechanism for correctness verification. When you ask an LLM to perform a task directly, for example to analyse a log file, summarize a dataset, or calculate some numbers, you are trusting a black box that has no intrinsic notion of truth.

There is a better pattern. Instead of asking the LLM to do the work, ask it to write a script that does the work. You can then inspect, test, and verify the script before executing it. Once verified, the script produces deterministic outputs every time: no sampling, no drift, no surprises.

This note explains the pattern, walks through a concrete log-analysis example, and discusses when to use it and when not to.

LLMs Give Different Answers to the Same Question

An LLM’s output is generated token by token, where each token is drawn from a probability distribution conditioned on the preceding tokens and the prompt. This design has three practical consequences:

Non-determinism: The same input can produce different outputs across runs, even with the same model and temperature setting.
No built-in verification: The model does not check its own work. It can produce confidently written answers that are numerically wrong, logically inconsistent, or factually fabricated.
No persistent state: Once a response is generated, the model does not remember it. Re-asking the same question starts fresh, with no memory of previous attempts or corrections.

These properties make LLMs unreliable for tasks that require precise, repeatable, auditable results. Direct analysis of data, such as log files, CSVs, and report numbers, is particularly risky because the output looks plausible even when it is wrong.

The Script Pattern: An Intermediate Layer of Determinism

The script pattern addresses these problems by introducing a verifiable intermediate artifact. Instead of asking the LLM to produce the answer, you ask it to produce a program that computes the answer.

The workflow looks like this:

Describe the requirement: Tell the LLM what data you have and what you want to extract or compute.
Request a script: Ask the LLM to write a script (Python, bash, awk, or whatever fits the task) that performs the analysis.
Inspect and verify: Read the script, check the logic, run it on a small test input, and confirm it does what you expect.
Execute: Run the script on the real data. The output is deterministic and auditable.
Iterate if needed: If the output is not quite right, describe the issue and let the LLM revise the script. Each revision is also inspectable.

The key insight is that scripts are verifiable. You can read the code, trace the logic, test edge cases, and confirm correctness before trusting the output. Once verified, the script will produce the same result every time it runs on the same data. An LLM response, by contrast, might change on the next invocation even if the data is identical.

Concrete Example: Analysing Application Logs

Imagine you have a web server log file (access.log) with entries like this:

192.168.1.10 - - [28/Jun/2026:14:23:01 +0000] "GET /api/users HTTP/1.1" 200 1234
192.168.1.20 - - [28/Jun/2026:14:23:05 +0000] "POST /api/orders HTTP/1.1" 500 52
192.168.1.10 - - [28/Jun/2026:14:23:08 +0000] "GET /api/users HTTP/1.1" 200 1234
192.168.1.30 - - [28/Jun/2026:14:23:12 +0000] "GET /api/products HTTP/1.1" 404 28
192.168.1.20 - - [28/Jun/2026:14:23:15 +0000] "POST /api/orders HTTP/1.1" 500 52

You want to know: which endpoints return the most 5xx errors?

The Direct Approach (Problematic)

You could paste the log data into the LLM prompt and ask:

Analyse this log. Which endpoints have the most 5xx errors?

The LLM might count the lines, identify the two 500 responses for /api/orders, and produce an answer that looks correct. But there are several failure modes:

With more than a few dozen lines, the model’s context window limits how much data it can see.
The model might miscount similar-looking entries.
The model might hallucinate entries that are not in the data.
Every re-run risks a different count.

If the log file has thousands of lines, you would need to truncate or summarize, introducing further opportunities for error.

The Script Approach (Reliable)

Instead, ask the LLM to write a script:

I have an Nginx access log at /var/log/nginx/access.log. Each line is in standard combined format. Write a Python script that reads the log, counts how many requests returned a 5xx status code per request path, and prints the results sorted by count descending. Include only paths that have at least one 5xx.

The LLM will produce something like this:

#!/usr/bin/env python3
import re
from collections import Counter

log_pattern = re.compile(
    r'^(\S+) \S+ \S+ \[[^\]]+\] "(\S+) (\S+) \S+" (\d+) \d+.*$'
)

status_counter = Counter()

with open('/var/log/nginx/access.log', 'r') as f:
    for line in f:
        match = log_pattern.match(line)
        if match:
            ip, method, path, status = match.groups()
            if status.startswith('5'):
                status_counter[(method, path)] += 1

for (method, path), count in status_counter.most_common():
    print(f"{method} {path}: {count}")

Now you can:

Read the script and confirm the regex captures the right fields and the 5xx filter is correct.
Test on a small sample by creating a test file with known entries and verifying the script’s output matches expectations.
Run on the real file and trust the result, because you have verified the logic.
Re-run anytime. The script always produces the same output for the same input.
Reuse and adapt. Tweak the script for different log formats, different status codes, or different aggregation criteria.

If the output is not quite what you wanted, you tell the LLM what to adjust: “Also include the total count of requests per path alongside the 5xx count.” The LLM revises the script, you inspect the diff, and you are back in business with a provably correct update.

Why This Pattern Works

The script pattern succeeds because it exploits the LLM’s strengths while avoiding its weaknesses.

LLMs are good at generating plausible code. Code generation is a pattern-matching task that aligns well with the transformer architecture. The model has seen millions of log-parsing scripts during training and can produce a reasonable one with high probability.

LLMs are bad at reliable computation. Arithmetic, counting, filtering, and aggregation require exact operations that are fundamentally at odds with a sampling-based architecture. Asking the model to perform these operations on your data is asking for trouble.

The script pattern bridges the gap: the LLM generates the logic, which you verify, and then a deterministic runtime (Python, bash, awk) executes the computation. You get the best of both worlds: the LLM’s fluency with code patterns and the computer’s exactness with arithmetic.

When to Use the Script Pattern

The script pattern is a good fit when:

The task involves counting, filtering, sorting, or aggregating data. Any operation where an off-by-one error or a hallucinated row would break the result.
You need auditable results. If someone asks “how did you get that number?”, you can point to the script and say “run it yourself.”
The data is large. Scripts handle files of any size. LLM context windows are limited to tens of thousands of tokens.
You will repeat the analysis. A script can be re-run daily on new data. An LLM prompt must be re-submitted each time, with no guarantee of identical results.
Iteration is expected. Each revision to a script produces a diff that can be reviewed. Each revision to a prompt produces a black-box response that must be re-checked from scratch.

The pattern is less useful when:

The task requires subjective judgment. Summarising a conversation, evaluating tone, or generating creative content are tasks where correctness is fuzzy and scriptability does not apply.
You need an answer right now and the data is trivial. If you have three numbers to add, typing them into a calculator is faster than asking for a script.
The environment cannot execute scripts. Some interfaces (mobile chat, restricted environments) do not allow running code.

Summary

LLMs are powerful tools for generating code, but unreliable tools for performing computation. The script pattern exploits this asymmetry: ask the LLM to write a script that does the work, inspect the script for correctness, and let the runtime execute it deterministically. For any task that involves data analysis, counting, aggregation, or any operation where correctness matters, the script pattern produces more reliable, auditable, and reusable results than asking the LLM directly.