At Swimm, we generate functional specs for complex applications based on their code. Here's an extremely simple example: The application allows users to create an account by providing their username, password, and email address. Once the account is created, users can log in using their credentials. The username must be unique and contain only alphanumeric characters. The password must be at least 8 characters long and include a mix of letters, numbers, and special characters. The email address must be in a valid format (e.g., user@example.com). To generate such specs automatically, we rely on a combination of static analysis, symbolic methods, and large language models (LLMs). We lean heavily on static analysis and deterministic methods for accuracy, but we also use LLMs to generate human-readable text that captures business logic and domain-specific terminology. This creates a tension: • Using symbolic/deterministic methods only: • : Can't generate truly clear specs, especially for non-technical users; • Using LLMs heavily: • : Non-deterministic and liable to generate variance • Using symbolic/deterministic methods only: • : Can't generate truly clear specs, especially for non-technical users; • Using LLMs heavily: • : Non-deterministic and liable to generate variance;Reliability is crucial for us, as our users need to be able to trust the generated specs. We thus need a way to measure our reliability. Perhaps we should start by asking: what does it mean for a spec to be reliable? A spec is reliable if it correctly describes the functionality of an application. That is, it includes all the functionality of the application, and no functionality that isn't there. To measure reliability, we need to compare each generated spec against its corresponding known-correct spec.

The quality of your output is determined not just by your prompt, but by the shape of the response you demand. Make it easy to hallucinate, and hallucination is what you'll get. For anyone building reliability-critical systems with LLMs: • Don't ask for binary judgments without evidence; • Structure outputs to require grounding in verifiable facts; • Force the model to commit to specifics before making conclusions; • Test extensively, especially on longer, more complex inputs • Don't ask for binary judgments without evidence; • Structure outputs to require grounding in verifiable facts; • Force the model to commit to specifics before making conclusions; • Test extensively, especially on longer, more complex inputs;At Swimm, this work now powers our production spec validation system. The semantic equivalence checker uses the structured approach from Attempt #4, validating thousands of generated specs with the reliability our users need. And perhaps most importantly: we've learned that trusting an LLM doesn't mean accepting its first answer. It means designing interfaces that make lying harder than telling the truth. This work is part of our trust program for our Application Understanding Platform at Swimmongoing effort to build reliable, trustworthy AI-generated documentation at Swimm. If you're interested in how we're using LLMs to understand and document complex codebases, check out swimm.io.

To trust an LLM, make lying harder than telling the truth

Q: Attempt #3: Adding Structure

Since the requirement extraction worked well, we'll keep it. But the cross-checking function needs work. Consider the format we used: The problems: 1. Vague: You can write anything under "explanation" 2. No evidence required: Just a boolean, no actual match 1. Vague: You can write anything under "explanation" 2. No evidence required: Just a boolean, no actual match To solve problem #1, let's add a verdict field with specific categories: This solved some of the issues that Attempt #2 created, but not all of them. The LLM still consistently claimed that Spec B included an equivalent requirement when it didn't.

Q: Attempt #4: Forcing Evidence

To finally solve this, I added two more fields—matchingIds and statementInOtherSpec: I also restructured the response so the LLM had to generate matchingIds and statementInOtherSpec before the matched boolean. This meant the LLM had to find evidence first before declaring a match, rather than declaring a match and then justifying it. The transformation was remarkable. Instead of: We now get: The LLM no longer hallucinated matches.

Q: Stepping Back: When Did the LLM Lie? What Can It Teach Us?

Looking back at our four attempts, a pattern emerges about when and how LLMs fail. The LLM didn't technically lie—it optimized for the wrong objective. This is not something I can prove, but rather a hunch. In Attempt #1, when asked "are these semantically equivalent?", the LLM had maximum freedom to interpret the question. It missed differences in long specs not because it couldn't see them, but because the task was too vague. It had an escape hatch: "close enough." In Attempt #2, something worse happened. By breaking down the task into extraction and matching, we introduced a new failure mode. The LLM could now "complete the pattern" from its training data: when asked if requirement X exists in Spec B, it would generate text that sounded like it found it—even fabricating the matching statement. This is classic hallucination territory: the LLM knows what a "yes, I found it" response should look like and delivers that pattern, facts be damned. The breakthrough came in Attempts #3 and #4 when we forced explicitness: • Adding specific ; • verdict categories: • Requiring ; • matchingIds: • Demanding ; • statementInOtherSpec: • Ordering matters: • : Generating evidence before the verdict prevented post-hoc rationalization • Adding specific ; • verdict categories: • Requiring ; • matchingIds: • Demanding ; • statementInOtherSpec: • Ordering matters: • : Generating evidence before the verdict prevented post-hoc rationalization;

Q: The Key Insight: LLMs Don't Lie When Cornered with Specificity

When the LLM could say "matched: true" without providing evidence, it happily did so. When forced to produce the actual matching text and IDs, it suddenly became honest: "Matching Ids: [], Statement in Other Spec: N/A." This teaches us something fundamental about working with LLMs: 1. Vague outputs enable hallucination: Boolean flags and free-text explanations give LLMs room to pattern-match their way to plausible-sounding lies 2. Structured outputs enforce honesty: Requiring specific, verifiable outputs (IDs, quoted text) creates accountability 3. Long context is dangerous without structure: The LLM never missed requirements in short specs, only long ones—until we forced structured decomposition 4. Order of reasoning matters: Making the LLM produce evidence before conclusions prevents rationalization 1. Vague outputs enable hallucination: Boolean flags and free-text explanations give LLMs room to pattern-match their way to plausible-sounding lies 2. Structured outputs enforce honesty: Requiring specific, verifiable outputs (IDs, quoted text) creates accountability 3. Long context is dangerous without structure: The LLM never missed requirements in short specs, only long ones—until we forced structured decomposition 4. Order of reasoning matters: Making the LLM produce evidence before conclusions prevents rationalization

The LLM Said It Found a Match. It Was Lying.

At Swimm, we heavily rely on deterministic methods, as well as LLMs, to generate functional specs. To make sure they’re reliable, we also need to validate the specs.

But here’s the catch: our validator uses an LLM too.

So we need to validate the validator. Which means we need to know when LLMs lie, hallucinate, or simply… forget things. Spoiler: they do all three, in surprisingly consistent ways.

This is the story of how I built a semantic equivalence checker for specs, watched it fail spectacularly, and learned exactly when and why LLMs can’t be trusted—by forcing them to be more explicit about their reasoning.

From this exploration, I hope you’ll learn about the pitfalls of relying on LLMs for critical tasks, and how to design around their limitations.

Note: the tests for this post were performed using GPT-4.1.

Setting the Stage

At Swimm, we generate functional specs for complex applications based on their code.

Here’s an extremely simple example:

The application allows users to create an account by providing their username, password, and email address.

Once the account is created, users can log in using their credentials.

The username must be unique and contain only alphanumeric characters.

The password must be at least 8 characters long and include a mix of letters, numbers, and special characters.

The email address must be in a valid format (e.g., user@example.com).

To generate such specs automatically, we rely on a combination of static analysis, symbolic methods, and large language models (LLMs). We lean heavily on static analysis and deterministic methods for accuracy, but we also use LLMs to generate human-readable text that captures business logic and domain-specific terminology.

This creates a tension:

Using symbolic/deterministic methods only: Can’t generate truly clear specs, especially for non-technical users
Using LLMs heavily: Non-deterministic and liable to generate variance

Reliability is crucial for us, as our users need to be able to trust the generated specs. We thus need a way to measure our reliability.

Perhaps we should start by asking: what does it mean for a spec to be reliable?

A spec is reliable if it correctly describes the functionality of an application. That is, it includes all the functionality of the application, and no functionality that isn’t there.

To measure reliability, we need to compare each generated spec against its corresponding known-correct spec.

“True Spec” vs “Generated Spec”

The idea is this: we take real applications and create “true” functional specs. We work hard to generate these specs, at least partially manually.

We then use our method to automatically generate a spec for the same application.

As a result we get:

generatedSpec(Program1)
trueSpec(Program1)

And we can ask if these two specs are semantically equivalent (let’s use ≈ to denote this):

generatedSpec(Program1) ≈ trueSpec(Program1)?

We can repeat this for Program1 – generate the spec multiple times and compare each to the manually created spec.

We can also do it for multiple programs…

Given a dataset of programs with corresponding trueSpecs, we can run this automatically and get a reliability score. If we achieve a very high score (hopefully close to 100%) across many programs, we can say we’re reliable.

But this entire approach hinges on one critical capability: reliably determining semantic equivalence.

Validating the Semantic Equivalence Checker

So how do we build this semantic equivalence checker? More importantly, how do we know it works correctly?

Task: Given two textual functional descriptions (specA, specB), return true if and only if specA and specB are semantically equivalent.

How do we know it’s reliable?

We create a dataset of many pairs: specA1-specB1, specA2-specB2, etc. Sometimes they’re semantically equivalent despite differences in wording, ordering, and phrasing. Other times they’re semantically different for various small reasons.

And we make sure we get the expected result.

Note that this is not a trivial task. I’ll focus on it throughout the rest of this post.

Let’s Start Simple

A first step is creating a small dataset of pairs. Starting from the simple example above and shortening it for clarity:

specA1 (baseline):
The username must be unique and contain only alphanumeric characters.
The password must be at least 8 characters long and include a mix of letters, numbers, and special characters.
The email address must be in a valid format (e.g., user@example.com).
We can now make modifications to specA1 such that the resulting spec is still semantically equivalent:

specB1 – Changing the ordering:
The email address must be in a valid format (e.g., user@example.com).
The username must be unique and contain only alphanumeric characters.
The password must be at least 8 characters long and include a mix of letters, numbers, and special characters.

specB2 – Changing the wording:
The email address must be a valid email address.
The user name must contain only alphanumeric characters, and must be unique.
The password must be at least 8 characters long and include a mix of special characters, letters, numbers.

specB3 – Breaking down a sentence:
The email address must be a valid email address.
The user name must contain only alphanumeric characters.
The user name must be unique.
The password must be at least 8 characters long and include a mix of special characters, letters, numbers.
We expect our isSemanticEquivalent function to return true for all pairs: specA1-specB1, specA1-specB2, specA1-specB3, and even specB1-specB2.
We also need pairs that are semantically not equivalent:

specC1 – Removing a requirement:
The username must be unique and contain only alphanumeric characters.
The email address must be in a valid format (e.g., user@example.com).

specC2 – Adding a requirement:
The username must be unique and contain only alphanumeric characters.
The password must be at least 8 characters long and include a mix of letters, numbers, and special characters.
The email address must be in a valid format (e.g., user@example.com).
The email address must end with ‘.com’ or ‘.org’.

specC3 – Changing a value (8 characters → 6 characters):
The username must be unique and contain only alphanumeric characters.
The password must be at least 6 characters long and include a mix of letters, numbers, and special characters.
The email address must be in a valid format (e.g., user@example.com).

We expect our function to return false for pairs like specA1-specC1, specA1-specC2, and specA1-specC3.

Attempt #1: Direct Comparison

Let’s build something!

Test Cases

We start by building a simple test suite:

interface SpecComparisonTestCase {

  name: string;

  specA: string;

  specB: string;

  expected: boolean;

  description: string;

}

const TEST_CASES: SpecComparisonTestCase[] = [

  {

    name: 'equivalent-different-wording',

    specA: `

The system must validate user input before processing

Users should receive email notifications when orders are completed

The application must support user authentication

Search results should be displayed in ascending order by relevance`,

    specB: `

Input validation is required before any processing occurs

Email alerts must be sent to users upon order completion

User login functionality must be implemented

Search outputs should be sorted by relevance in ascending order`,

    expected: true,

    description: 'Same requirements with different wording',

  },

  // … more test cases

]

For the first attempt, we use a straightforward prompt:

The prompt

You are an expert product manager. Compare the following two product specifications and determine if they are semantically equivalent.

Two specifications are semantically equivalent if they describe the same functional requirements, even if they use different wording, ordering, or phrasing.

Return a JSON object: { equivalent: boolean, explanation: string }

Specification A:

${specA}

Specification B:

${specB}

Results

On a dataset of 39 test cases run 3 times each (117 checks total):

✅ 114 passed
❌ 3 failed consistently

Not bad! 97.4% accuracy. But those 3 failures were interesting—and consistent.

All three failing cases had the same pattern: Spec A included an additional requirement that Spec B did not have, and the LLM missed it.

For example, in one case, Spec A included this requirement:

– **CC32, CC33**:

– Apply only when months in force > 12

– CC32: Apply tiered percentages from T123

– CC33: Apply commission rate from T234

Spec B didn’t include this at all. Somehow, the LLM missed this difference.

Interestingly, when the specs were short (< 10 lines), the LLM never missed such differences. It only happened with long specs.

Attempt #2: Decomposition

If the LLM can’t reliably detect missing requirements in long specs, perhaps we can help it by breaking down the task.

I decided to split the comparison into two smaller tasks:

Extract requirements: Ask the LLM to create a list of all requirements in Spec A, then in Spec B. Each requirement gets an ID.
Cross-check requirements: For every requirement in Spec A, find its equivalent in Spec B. For every requirement in Spec B, find its equivalent in Spec A.

Step 1: Extracting Requirements

The prompt

You are an expert product manager. Extract all distinct requirements, rules, and constraints from the following product specification.

Respond with a JSON array of objects: { requirement: string }.

Be thorough and do not miss any requirements.

Spoiler alert: The LLM did a great job here. It extracted all requirements, even from long specs, reliably. Yes, even with this simple prompt.

Step 2: Cross-Checking Requirements

We take the responses from step 1 and feed them into a prompt:

The prompt

You are an expert product manager. For each requirement in the provided list (with id), check if it is present (semantically equivalent) in the given product specification.

For each, respond with: { id: number, explanation: string, matched: boolean }

Return a JSON array of these objects, one for each id provided.

Semantic Equivalence Rules: Requirements are semantically equivalent if…
[rules omitted for brevity]

I should note that the LLM would sometimes “forget” to mention some IDs, so we had to remind it: “You missed requirements #32, #44, please reply with them.”

Results 🥁

On the same 39 test cases × 3 runs:

✅ 54 passed
❌ 63 failed

WOW, that’s a much worse result!

Suddenly, pairs that were correctly reported as “equivalent” in Attempt #1 were now reported as “not equivalent.”

Consider this explanation from the LLM:

LLM response

Cross-checking requirements from B in A:

Requirement #1: The program receives a file named ‘INPUT.DAT’ as input.
Explanation: Specification A does not mention the input file must be named ‘INPUT.DAT’;

                 it only describes the input as a set of records.

                 The specific file name requirement is missing.

Matched: true

Whether the input filename is INPUT.DAT or not is debatable. Interestingly, before we asked the LLM to break down requirements and cross-compare, it didn’t consider this a semantic difference.

But here’s the truly problematic case:

LLM response

Cross-checking requirements from A in B:
…

Requirement #42: CC32, CC33: Apply only after 12 months in force.

CC32: Apply tiered percentages from T123.

CC33: Apply commission rate from T234

Explanation: Specification B states CC32, CC33: Apply only after 12 months in force.

CC32: Use T123 tiered percentages. CC33: Use T234 commission rate.

This matches the requirement.

Matched: true

…

That’s almost outrageous! The LLM literally copied the requirement from Spec A and claimed it exists in Spec B, when it doesn’t. When cross-checking from B to A, this requirement doesn’t appear at all.

I double-checked my test case to make sure I hadn’t missed it. Nope! It was all LLM hallucination.

Attempt #3: Adding Structure

Since the requirement extraction worked well, we’ll keep it. But the cross-checking function needs work.

Consider the format we used:

Array<{

  id: number;

  explanation: string;

  matched: boolean;

}>

The problems:

Vague: You can write anything under “explanation”
No evidence required: Just a boolean, no actual match

To solve problem #1, let’s add a verdict field with specific categories:

Array<{

  id: number;

  explanation: string;

  verdict:

    | 'completely missing from the other spec'

    | 'different values'

    | 'different logic'

    | 'explicitly stated in spec A, implicitly in spec B'

    | 'semantically equivalent';

  matched: boolean;

}>

This solved some of the issues that Attempt #2 created, but not all of them. The LLM still consistently claimed that Spec B included an equivalent requirement when it didn’t.

Attempt #4: Forcing Evidence

To finally solve this, I added two more fields—matchingIds and statementInOtherSpec:

Array<{

  id: number;

  matchingIds: number[];

  statementInOtherSpec: string;

  explanation: string;

  verdict:

    | 'completely missing from the other spec'

    | 'different values'

    | 'different logic'

    | 'explicitly stated in spec A, implicitly in spec B'

    | 'semantically equivalent';

  matched: boolean;

}>

I also restructured the response so the LLM had to generate matchingIds and statementInOtherSpec before the matched boolean. This meant the LLM had to find evidence first before declaring a match, rather than declaring a match and then justifying it.

The transformation was remarkable.

Instead of:

LLM response

Requirement #42: CC32, CC33: Apply only after 12 months in force…

Explanation: Specification B states CC32, CC33: Apply only after 12 months…

Matched: true

We now get:

LLM response

Requirement #42: CC32, CC33: Apply only after 12 months in force…

Matching Ids: []

Statement in Other Spec: N/A

Explanation: This requirement is not present in Specification B.

Matched: false

The LLM no longer hallucinated matches.

Stepping Back: When Did the LLM Lie? What Can It Teach Us?

Looking back at our four attempts, a pattern emerges about when and how LLMs fail.

The LLM didn’t technically lie—it optimized for the wrong objective.

This is not something I can prove, but rather a hunch. In Attempt #1, when asked “are these semantically equivalent?”, the LLM had maximum freedom to interpret the question. It missed differences in long specs not because it couldn’t see them, but because the task was too vague. It had an escape hatch: “close enough.”

In Attempt #2, something worse happened. By breaking down the task into extraction and matching, we introduced a new failure mode. The LLM could now “complete the pattern” from its training data: when asked if requirement X exists in Spec B, it would generate text that sounded like it found it—even fabricating the matching statement. This is classic hallucination territory: the LLM knows what a “yes, I found it” response should look like and delivers that pattern, facts be damned.

The breakthrough came in Attempts #3 and #4 when we forced explicitness:

Adding specific verdict categories removed wiggle room
Requiring matchingIds meant the LLM had to commit to specific requirements
Demanding statementInOtherSpec forced it to ground its reasoning in actual text
Ordering matters: Generating evidence before the verdict prevented post-hoc rationalization

The Key Insight: LLMs Don’t Lie When Cornered with Specificity

When the LLM could say “matched: true” without providing evidence, it happily did so. When forced to produce the actual matching text and IDs, it suddenly became honest: “Matching Ids: [], Statement in Other Spec: N/A.”

This teaches us something fundamental about working with LLMs:

Vague outputs enable hallucination: Boolean flags and free-text explanations give LLMs room to pattern-match their way to plausible-sounding lies
Structured outputs enforce honesty: Requiring specific, verifiable outputs (IDs, quoted text) creates accountability
Long context is dangerous without structure: The LLM never missed requirements in short specs, only long ones—until we forced structured decomposition
Order of reasoning matters: Making the LLM produce evidence before conclusions prevents rationalization

The Broader Lesson

The quality of your output is determined not just by your prompt, but by the shape of the response you demand. Make it easy to hallucinate, and hallucination is what you’ll get.

For anyone building reliability-critical systems with LLMs:

Don’t ask for binary judgments without evidence
Structure outputs to require grounding in verifiable facts
Force the model to commit to specifics before making conclusions
Test extensively, especially on longer, more complex inputs

At Swimm, this work now powers our production spec validation system. The semantic equivalence checker uses the structured approach from Attempt #4, validating thousands of generated specs with the reliability our users need.

And perhaps most importantly: we’ve learned that trusting an LLM doesn’t mean accepting its first answer. It means designing interfaces that make lying harder than telling the truth.

This work is part of our trust program for our Application Understanding Platform at Swimmongoing effort to build reliable, trustworthy AI-generated documentation at Swimm. If you’re interested in how we’re using LLMs to understand and document complex codebases, check out swimm.io.

Omer Rosenbaum

CTO & Co-founder

Omer founded the Check Point Security Academy and was the Cyber Security Lead at ITC, an educational organization that trains talented professionals to develop careers in technology. Omer has a MA in Linguistics from Tel Aviv University and is the creator behind the Brief YouTube Channel.