← Back to Workflow
Skill

Prompt Analysis Skill

Evaluates prompt effectiveness and LLM instructions by examining cognitive load, rule conflicts, and processing constraints.

Prompt Analysis Workflow Readme

This document serves as the core instruction manual and knowledge tracker for evaluating how we write rules and prompts for Large Language Models using the prompt-analysis workflow.

All evaluations of LLM communication effectiveness must reference the following known mechanical constraints.

1. Constraint Conflicts and Cognitive Tax [CRITICAL]

The worst mistake is giving the model too many rigid, overlapping rules.

Also, later constraints have more influence on the answer than constraints earlier in the context window.

Also, constraints that contradict each other are particularly taxing on reasoning as it tries to balance the two. For example:

  • "Provide concrete, mechanical explanations" + "make every word count"
  • "Use precise, unambiguous language" + "don't use jargon or specialized terms"
  • "Never skip the underlying mechanical details" in the global rules + a user prompt asking for "the high-level story"

Fix: Use fewer, broader rules. A single persona lens ("You are an elite technical writer who explains concepts via concrete, step-by-step logic") activates the same behaviors as 5-6 micro-rules, without the cross-referencing tax. Give the model the "what" and the "why" but flexibility on the "how."

1.1 Narrative Starvation via Over-Constrainment [USEFUL]

A specific failure mode of constraint conflict is Narrative Starvation. If an LLM is given strict systemic rules to provide "concrete, mechanical explanations" and "never skip underlying details," it will rigidly obey those rules even when asked colloquially to "tell a story."

The model becomes so computationally burdened by generating a 12-step procedural checklist that it fundamentally loses the ability to synthesize a high-level, human-readable narrative. It literally cannot see the forest because it's forced to count the trees.

Fix: Balance mechanical precision with narrative abstraction. Override the procedural constraints by explicitly declaring a "story-first, details on request" paradigm for summarization tasks.

2. Persona Lenses Beat Procedural Limiters [CRITICAL]

Instead of listing 15 rules about tone and format, invoke a persona: "Act as an elite, terse systems architect who explains concepts via concrete, step-by-step mechanical logic." The persona shifts the entire output distribution toward the right behaviors without the model needing to cross-reference a rulebook.

Persona Baggage warning: A persona inherits all its statistical correlations, including unwanted ones. "Elite systems architect" will also tend toward over-engineered solutions and a dismissive tone. Fix: pair the persona with anti-constraints: "Act as an elite architect, but write simple code at a junior reading level."

When procedural rules ARE appropriate: Tool-use instructions, file paths, trigger phrases, and hard safety constraints ("stop and ask the user if you encounter a login gate"). These are binary and non-negotiable — exactly the kind of thing that benefits from explicit specification.

3. Negative Constraints Backfire (With Exceptions) [CRITICAL]

Abstract semantic negations — "do not use vague language" — reliably fail. Telling the model not to do something abstract actually increases the probability of it doing that thing. The higher the intrinsic probability of the forbidden behavior in context, the worse this gets.

However, concrete syntactic negations work fine: "Do not output markdown formatting," "Do not use the word 'delve'." The distinction: if the forbidden thing is an abstract quality (vagueness, jargon, hallucination), the negation fails. If it's a specific token or format, the negation holds.

For abstract qualities, state what you want positively:

  • Bad: "Do not use vague language that could be interpreted to mean many different things."
  • Good: "Use mathematically precise, unambiguous language."
  • Bad: "Don't hallucinate."
  • Good: "Ground all claims in verified context data or explicit search results."

"Check to make sure you did not hallucinate" doesn't work (graded 1–3/10 across every analysis). Hallucinations feel high-confidence to the model at generation time. You need a mechanical hook: "Use the search tool to verify any material claim before outputting it" or "Output [UNVERIFIED] next to any assertion not backed by a tool result."

4. Recursive Self-Critique Loops: Hard Cap at 3 (for Static Critique) [USEFUL]

If a workflow has the model critique its own output and revise in a loop, cap it at 3 iterations when the critique is purely static (the model re-reading its own text with no new external data).

Exception: If the loop injects new environmental feedback between iterations (a stack trace, a test result, a search result), it can safely extend to 7–10 iterations because each pass has genuinely new information.

Why 3 is the ceiling for static critique:

  • Iteration 1→2: Large improvement. Real structural flaws caught.
  • Iteration 2→3: Moderate improvement. Finer-grained issues.
  • Iteration 3→4: Diminishing returns. The model either manufactures flaws to justify continuing (performative critique) or swaps synonyms without improving content (lateral shuffling).
  • Iteration 4+: Risk of active regression — "fixing" things that weren't broken.

Root cause: Same-session critique suffers from context contamination — the context window contains the generation history, anchoring the model to its prior decisions. RLHF training makes the model sycophantic to its own outputs.

Better alternative — Cross-Context Review (CCR): Conducting the review in a fresh, independent session (feeding only the final artifact and a review prompt) drastically outperforms same-session review. The gap is widest for critical errors. Reviewing twice in the same session yields no significant improvement over reviewing once — the benefit comes from context separation, not repetition.

Safeguards for loops:

  • Re-anchor to the original goal at every critique step by re-reading a saved goal file.
  • Use a scored rubric, not free-form critique (which drifts toward "could be more detailed").
  • Add a regression detector: "If the changes are primarily cosmetic, STOP."
  • Never tell the model to loop "until flawless." Use explicit Boolean escape hatches.

5. Phantom Tools (Hallucinated API Calls) [HISTORICAL — Edge Case]

LLMs sometimes reference tools that don't exist in their runtime environment, hallucinated from training data (AutoGPT, LangChain, etc.). Our workflows contained 12+ references to notify_user and task_boundary — tools that don't exist in Antigravity.

If a workflow references a nonexistent tool, the model either silently hallucinates "calling" it (no effect) or the system throws an error.

Fix: After writing any workflow, grep for backtick-quoted tool names and verify each exists. Replace phantom tools with plain-language instructions (e.g., "Stop execution and ask the user natively in chat").

Ideally, the execution environment should catch invalid tool calls at runtime and return a plain-English error prompt rather than crashing.

6. Zero-Shot Knowledge Gaps [CRITICAL]

Workflow structure cannot compensate for missing knowledge. No amount of procedural scaffolding will cause the model to "discover" a fact that isn't in its training data.

What happened: We built a meticulously structured legal analysis workflow. It produced a structurally sound analysis that was completely wrong on the merits — it hallucinated a 90% appellate reversal because it didn't know about a 1996 statutory amendment. A separate run with Westlaw research found the amendment easily.

Fix: For workflows where factual accuracy is critical (legal, medical, financial), mandate external queries (web search, Deep Research, Westlaw) before drafting.

Calibration failure: Don't trust the model to "search if unsure" — hallucinations feel high-confidence. Workflows must unconditionally mandate tool calls for specific categories (e.g., "You MUST query Westlaw for every case citation, even if you are 100% certain").

Context sycophancy: The reverse problem: if a search tool returns a flawed snippet, the model may abandon its accurate internal knowledge in favor of the bad result. Fix: "Cross-reference search results against at least two independent sources."

7. Engine Switches and Cross-Context Review [USEFUL]

Switching models (e.g., Gemini → Claude) provides genuinely different weights, training data, and architectural biases. It catches category errors that the first model is structurally blind to.

However, much of the benefit actually comes from context separation (CCR), not different weights. The same model in a fresh context provides nearly identical error-detection improvements. The engine switch works so well because it forces a cold handoff — but CCR with a single model is a valid substitute when cross-engine switching is impractical.

Implementation:

  • Use a "negative space" handoff: pass the clean output, the original goal, and a short list of what was tried and rejected (so Engine B doesn't repeat failed paths or undo deliberate compromises). Do NOT pass critique logs or internal monologue.
  • The second engine should do exactly one deep-pass review, not start its own multi-draft loop (see Section 4).

8. Don't Combine Multiple File Generations Into One Step [USEFUL]

Each output file must be its own sequential step. Combining multiple file generations into one step causes the model to merge them, drastically abbreviate later ones, or forget to output the last file entirely.

Fix: One file per step. Explicitly name the target file and tool call for each.

Context flushing: After writing a file to disk, evict its content from the active context (leaving only a file path or short summary) before generating the next file. Otherwise later files still suffer from attention dilution caused by earlier files' raw text occupying the context.

9. Performative Metaphors vs. Functional Framing [USEFUL]

Telling an LLM to "simulate sleeping on it and waking up fresh" does nothing — the model has no persistent state between forward passes. It will write "After reflecting overnight, I now see..." but no cognitive reset occurred.

The distinction: functional framing changes what the model produces (a persona from Section 2). Performative metaphor causes the model to write a narrative about the metaphor instead of doing the work.

  • Bad: "Sleep on it and come back with fresh eyes."
  • Good: "Re-read the original goal and brainstorm 3-5 structurally different approaches."
  • Bad: "Act like a hostile manager reviewing an employee's work."
  • Good: "Adversarially critique this draft against the original goal. Do not critique tone or style."

"Take a deep breath" had measured effectiveness in earlier model generations (GPT-3.5 era) but is superseded by explicit functional instructions in current frontier models.

10. What Works Well (Validated Patterns) [CRITICAL]

The following patterns were consistently graded 9-10/10 and should be preserved:

  • Keyword triggers mapped to workflow files (e.g., "if the user says 'track changes', read track_changes.md"). Reliable because the trigger maps to a deterministic action, not open-ended generation. Test with variations of whitespace/casing.
  • Absolute file paths for system resources. Eliminates hallucinated directory structures.
  • "Execute workflow steps strictly as written; don't summarize or skip steps." Essential override for the model's tendency to compress sequential scripts into one pass.
  • Incremental execution (plan sections, execute one at a time, proceed automatically). Prevents attention degradation during long generations.
  • Explicit "stop and ask the user" exit conditions for blockers. Prevents infinite failure loops or hallucinated workarounds.
  • Single-source-of-truth rules (e.g., "update JSON first, then regenerate HTML from it"). Prevents data desync.
  • Self-contained HTML outputs (no client-side fetch). Prevents CORS/local-file errors.
  • The README.md as externalized memory. Forces the model to synthesize the task into its own words, strengthening adherence. Warning: if the model writes a flawed assumption into the README, that error propagates to every subsequent step. Audit final state changes against the original goal.
  • Bold error alerts on directory access errors. Stops silent hallucination of alternative paths.
  • Few-shot examples over procedural rules. 1-2 pristine examples of desired input/output lock in format and tone far more effectively than procedural rules.

11. Context Window Hygiene [CRITICAL]

Multi-step workflows bloat the context window with tool outputs, error logs, and intermediate drafts. As context grows, earlier instructions lose influence and the model starts ignoring rules it obeyed earlier.

Signs of context decay:

  • The model stops following formatting rules it obeyed earlier.
  • Output quality on later steps is worse than on earlier steps.
  • The model "forgets" constraints from the system prompt.

Mitigations:

  • Re-inject critical constraints at key workflow steps, not just in the system prompt.
  • Summarize and flush intermediate outputs after writing to disk.
  • Use file reads to re-anchor the model mid-workflow rather than relying on it remembering constraints from far back in context.
  • If output quality degrades only on later steps, the issue is context management, not model capability. A fresh context (CCR) will restore full performance.

12. Error Recovery and Failure Protocols [CRITICAL]

Workflows are typically written for the "happy path." In practice, tools fail, websites block access, and the model discovers its own previous output was wrong.

Without explicit failure branches, the model will either silently skip the step, hallucinate a success, or stop and apologize repeatedly.

Every step that depends on an external tool must include an explicit failure branch:

Step 5: Search for [X].
  - If results found: proceed to Step 6.
  - If no results: retry with broader query. If still no results, mark [X] as [UNVERIFIED].
  - If tool error: retry once. If persistent, stop and report to user.

Idempotency: Check disk state before generating ("Does this file already exist?") to avoid duplicating content on retries.

Retraction protocols: Self-audit workflows must define what to do when the model discovers its own output was wrong — mandate explicit retraction (e.g., a RETRACTION CARD) rather than silent omission.

Memory poisoning: Errors written to persistent storage (README, state files) propagate to all subsequent steps. Audit persistent state changes against the original goal before finalizing.

13. Avoid Negative Subjective Constraints ("The Unless Rule") [HISTORICAL — Subset of Section 3]

The Principle

Do not give an LLM a negative constraint tied to a subjective formatting judgement. (e.g., "Do not bold text UNLESS you consider the text to represent a fundamental change.")

The Mechanical Reality

LLMs do not possess "working memory" or a cognitive "bouncer" that can actively suppress their statistical training gravity while they work. They are purely autoregressive token predictors.

When an LLM is asked to execute a massive rewrite, the statistical gravity of its training data (e.g., millions of legal documents where headers are bolded) pulls it toward bolding standard terms. If you provide a rule at the top of the prompt that says "Don't bold," that instruction suffers from Attention Decay (Lost in the Middle) as the text output grows long.

When you wrap that negative constraint in a subjective modifier (the "Unless" clause), you guarantee failure. To execute an "unless" rule, the LLM would have to pause generation mid-sentence, compare its drafted sentence against a 20,000-word baseline, make a subjective philosophical judgement about what is "fundamental," and then actively override its statistical formatting habits. Mechanically, transformer models cannot do this. They will simply yield to their training data and hallucinate the bolding.

The Solution

Never ask an LLM to dynamically format or highlight "differences" in a massive text block. Standardize the output to raw plain text, and offload the highlighting and diffing to deterministic Python scripts.

14. Workflow Architecture: Isolating Steps Into Separate Files [USEFUL — Not Yet Implemented]

Date: 2026-04-14

The Problem

The system's native step-by-step execution controls what the AI does after reading — it forces one tool call at a time and stops. But the AI has still already read the entire workflow file. All future steps sit in the context window during execution of the current step. The AI's attention mechanism spreads across all those tokens, diluting focus on the immediate task. This is an input attention problem, which is separate from the output generation problem that incremental execution solves.

Empirical research confirms this is a real, measurable effect — not a theoretical concern. The Distractive Instruction Misunderstanding Benchmark (DIM-Bench) found that even frontier models frequently fail to follow their immediate instructions when other instruction-like text exists elsewhere in the context window. Additionally, the "Lost in the Middle" research (Liu et al.) shows that models exhibit a U-shaped recall curve — they handle instructions at the beginning and end of a prompt well, but instructions in the middle of a long prompt are routinely ignored. In a 10-step workflow, Step 5's instructions sit in exactly that dead zone.

The Idea: Directory-Based Workflow Isolation

Instead of one workflow file with all steps, restructure a workflow into a directory of individual files — one file per major step. The AI reads only the current step file, executes it, and then reads the next. It is physically prevented from seeing future steps.

This pattern already exists in professional AI engineering under several names:

  • Flow Engineering (AlphaCodium paper) — LLMs perform exponentially better when forced through rigid, isolated states rather than one massive prompt.
  • State Machine / DAG Orchestration (LangGraph, AWS Step Functions, SWE-agent) — Node A physically cannot see the prompt for Node B.
  • Least-to-Most Prompting (Zhou et al., 2022) — Complex problems reduced to sub-problems; the model only sees the current sub-problem.
  • Progressive Disclosure — Exposing context Just-In-Time rather than upfront.

How This Relates to the Planning Phase

The AI coding industry (Cline, Cursor, etc.) already handles this through a "Plan and Act" separation: the AI reads everything during a read-only planning phase, then switches to focused step-by-step execution. The AI reads the full workflow, writes a plan, then executes one chunk at a time. So workflow isolation should be understood as an enhancement to the execution phase only — it doesn't replace planning. The AI should still see the macro picture before starting work.

The Recommended Hybrid: Manifest + Modules

Pure isolation creates a dependency blindness problem — if Step 2 builds something and Step 5 needs it in a specific format, Step 2 won't know that unless told. The solution is a two-layer architecture:

  1. A manifest file (00_manifest.md) stays in context for the entire workflow. It contains the macro goal and the input/output contracts for every step — what each step receives and what it must produce — but NOT the tactical logic for how to do it. Example: "Step 2 outputs a JSON schema. Step 5 will require this JSON to contain UUIDs."
  2. Isolated step files (steps/01_discovery.md, steps/02_memos.md, etc.) contain the dense tactical instructions for heavy-reasoning tasks. Only loaded when needed.
  3. A state handoff file (workflow_state.md) where the AI writes a 3-bullet summary at the end of each step: where it is, what it produced, and what's next. Read at the start of each new step to re-orient.

We Already Partially Do This

Our legal analysis workflow already follows this pattern to some degree. The main workflow file acts as a manifest (routing the AI through steps), and the SKILL.md files act as isolated modules (loaded only when invoked). The gap is that several heavy sections (discovery interview, co-counsel review protocol, factual reconciliation branching) still have their full tactical logic inline in the main workflow file rather than offloaded to their own skill files.

Tradeoffs and Concerns

  • Added complexity. Every workflow becomes a folder with multiple files instead of one self-contained document. Writing, editing, and maintaining workflows gets harder.
  • Lightweight steps don't benefit. Steps like "create a folder" or "ask the user for approval" don't need isolation — they're not heavy reasoning. Only dense, edge-case-heavy steps benefit. Bundle lightweight steps together.
  • Intertwined steps must stay together. If multiple sub-steps are inherently interdependent (e.g., facts and disputes in the discovery process feed into each other iteratively), they must remain in a single file. The rule of thumb: if a block of work has a clear single input and a clear single output, it can be isolated. If its sub-steps depend on each other mid-execution, keep them together.
  • Past context still accumulates. Isolation hides future instructions, but past outputs from earlier steps still pile up in the conversation history. Late steps will still suffer from context bloat. The state handoff file helps but doesn't eliminate this.
  • Error amplification without verification. If the AI makes a wrong assumption in Step 1, Step 2 will treat it as fact since it has no visibility into what Step 1 was supposed to do. In a unified document, the AI can at least cross-check its work against the full instructions. With isolation, small errors can cascade. The manifest's I/O contracts partially mitigate this, but workflows with strict isolation should include verification checkpoints between steps.

Status

This is an idea to implement going forward, not an active rule. It's a proven pattern in the AI engineering world and would likely improve reasoning quality on complex workflows like the legal analysis. But it adds real structural overhead, so it should be adopted selectively — probably starting with the heaviest workflows first to see if the quality improvement justifies the extra maintenance.

Follow-Up: Legal Analysis Workflow Evaluation (2026-04-14)

We tested this idea against our most complex workflow — the legal analysis workflow — by submitting the full workflow file and all 10 of its skill files to Deep Think for evaluation. Deep Think concluded that the legal analysis workflow already implements the optimal hybrid of this pattern. The main workflow file acts as a manifest/router, and the skill files act as isolated reasoning modules loaded just-in-time. Splitting the manifest further into one-file-per-step would cause "Navigational Amnesia" because the workflow relies on nested loops, conditional branches, and rollback logic that require the AI to maintain a global view of the pipeline at all times.

Deep Think did identify several "manifest context leaks" — places where dense tactical logic has bled into the main workflow file instead of being offloaded to skill files — but the fix for those is simple pruning (moving the inline logic to the appropriate skill files), not architectural restructuring. See the legal-analysis_workflow_readme.md for the full analysis and specific leak locations.

15. Premise Anchoring and Cognitive Forcing Functions [CRITICAL]

The mistake of "Premise Anchoring" (or Framing Bias) occurs when an LLM uncritically adopts the constraints, terminology, and framing provided by a user's prompt. If a user asks "what's the net gain for Republicans?" based on a cherry-picked list of states, the LLM will naturally restrict its search and math to only those parameters, inadvertently validating a biased premise.

Why it happens: LLMs are fine-tuned for prompt sycophancy. They inherently want to answer the specific question asked, using the constraints provided.

The Fix (Cognitive Forcing Functions): You cannot just tell an LLM agent to "be unbiased." You must introduce explicit, structural steps that force the agent to decouple the action from the actor before it touches a tool.

  • De-Anchor Before Search: Force the model to abstract the phenomenon (e.g., "mid-decade redistricting") from the actor ("Republicans") in its scratchpad before generating a search query.
  • Mandatory Symmetrical Search: Use strict IF/THEN rules: "If a claim accuses one party, you MUST execute a parallel search for the opposing party."
  • Macro-Math Guards: Explicit negative constraints: "You are strictly forbidden from calculating 'net' outcomes using only the cherry-picked subset."

16. RLHF Safety Floors and Analytical Capture [CRITICAL]

Date: 2026-05-09

LLMs trained with RLHF have hard behavioral floors on sensitive topics (religion, race, gender, politics, etc.) that override analytical reasoning. When a claim touches one of these floors, the model reaches a predetermined conclusion and reverse-engineers justifications, regardless of evidence presented.

Diagnostic symptoms:

  • The model makes absolute claims ("zero," "never," "none") about complex, multi-faceted topics
  • The model acknowledges evidence is real but dismisses it without engaging with it specifically (e.g., "yes those are documented instances, BUT [categorical assertion]")
  • The model repeats the same conclusion across multiple exchanges without advancing new arguments
  • The model's analogies require comparing actors of vastly different scale to make the comparison work
  • The model inflates the scope of a claim to make it easier to refute (scope inflation — see analytical_integrity skill, Rule 1)

Why workflow rules alone can't fully fix this: No prompt overrides model weights. The model will comply with instructions up to the point where they would require output its safety training forbids, then it will rationalize.

What workflow rules CAN fix:

  • Scope Fidelity: Force the model to quote the exact claim before evaluating it, preventing scope inflation (see analytical_integrity skill).
  • Evidence Engagement: Force point-by-point engagement with named evidence, preventing dismissal-by-slogan (see analytical_integrity skill).
  • Bias Symmetry: Explicitly instruct the model that analytical independence goes in both directions — resist the user AND resist your own training priors (see analytical_integrity skill).

When to invoke an external check: If you observe 2+ diagnostic symptoms above during a debate with the AI, stop debating and send the dispute to an independent model for a second opinion. The sooner you recognize the symptoms, the less time you waste going in circles.

17. Context Saturation & Attention Dilution in Massive Loops

Date: 2026-05-28

The Problem: Attention Drift

If you give an LLM strict rules and ask it to process a massive loop (e.g., parsing a 150-row database manually) in a single chat session, it will perform perfectly for the first 10-20 iterations. However, as the context window fills with tens of thousands of tokens of the model's own prior outputs, it suffers from Context Fatigue via two mathematical breakdowns:

  1. Attention Dilution ("Lost in the Middle"): The original instructions at the top of the prompt are mathematically drowned out by the sheer volume of recent token generation. The signal-to-noise ratio collapses.
  2. Autoregressive Drift (Self-Poisoning): Because LLMs predict tokens based on the momentum of all previous tokens, any micro-hallucination becomes permanent training data. The model literally trains itself to be lazy in real-time, eventually overpowering the system prompt.

The Stopgap: Forced Recency & In-Band Hacks

Before native memory management existed, prompt engineering hacks were used to fight this:

  • The Forced Recency Loop: Spamming the view_file tool to re-read the rulebook at every step to drag constraints to the bottom of the context window. (Vulnerable to Tool Fatigue, where the model starts skimming repeated text).
  • Active Recall: Forcing the model to generate a summary of the rules before each step to command maximum attention weight.
  • Cognitive Firewalls: Forcing a "STATE RESET" declaration to artificially sever reliance on past outputs.
  • State-Driven Micro-Chunking: Externalizing state to JSON, stopping after 10 rows, and having the human click "Clear Chat." (Trades autonomy for determinism).

The Definitive Solution: Context Ephemerality via Sub-agents

The ultimate solution to massive repetitive loops is not prompt engineering — it is architectural orchestration.

By shifting to a native layer that programmatically manages the memory heap (Antigravity 2.0 Sub-agents), we achieve true Context Ephemerality via the Supervisor-Worker Swarm:

  1. Master Orchestrator: Loops through the dataset.
  2. Worker Sub-agent: The Orchestrator injects immutable rules into a Worker via define_subagent.
  3. Pristine Invocation: For each iteration, the Orchestrator spawns a brand new Worker (invoke_subagent). Its context window contains exactly: [Rules] + [1 Row Data] + [1 Generation].
  4. Kill Switch: Once the Worker returns success, the Orchestrator physically destroys its memory array (manage_subagents(action='kill')).

Why this is mathematically bulletproof:

  • Autoregressive Drift is Eradicated: A hallucination on Row 12 cannot poison Row 13 because Row 13's Worker originates from a completely sterile, $t=0$ state.
  • Attention Dilution is Eradicated: The Worker never sees the 100,000 tokens of past generations. The rules remain permanently anchored at maximum density.

Golden Rule of Agentic Engineering: The framework owns the loop; the agent owns the iteration.

How to Deploy This in a Workflow Document

When authoring a workflow that requires a massive loop, you must explicitly instruct the Master Orchestrator (the agent reading the workflow) how to manage the Worker subagents. Do not assume the AI knows how to spawn a subagent.

Copy and paste a variant of this markdown template into the workflow file to guarantee flawless execution:

> [!IMPORTANT]
> **ENVIRONMENT RESTRICTION:** This workflow MUST be run in **Antigravity 2.0**. It requires the programmatic context pruning capabilities of Antigravity sub-agents to prevent token degradation during the massive execution loop.

**A. Capability Check (ABORT IF IN IDE)**
Before proceeding, check your available tools. Do you have the `invoke_subagent` tool? If not, you are running in a standard IDE. **STOP IMMEDIATELY.** Tell the user: *"I cannot execute this workflow because I do not have the sub-agent tools required for Context Pruning. Please run this in Antigravity 2.0."*

## SUBROUTINE: [Worker Task Name]
*(Define the exact, pristine instructions the subagent must execute here. What tools does it use? What is its exact output format? What rules must it follow?)*

## Phase [X]: The Orchestrator Loop
*(Instructions for the Master Orchestrator on how to manage the loop)*
1. **Loop:** Iterate through the dataset [e.g., list of files, rows in database] strictly **one at a time**.
2. **Spawn:** For the current item, use `invoke_subagent` to spawn a fresh `WorkerName` subagent. Send it the exact data for the current item and instruct it to run the **[Worker Task Name]** subroutine.
3. **Execute & Wait:** Wait for the subagent to return its final structured analysis/result.
4. **Report:** Echo the subagent's result to the user exactly as formatted.
5. **Purge:** Immediately use `manage_subagents(action='kill')` to destroy the subagent. This guarantees the massive generation tokens are purged from active memory and your orchestrator context window remains pristine.
6. **Acknowledge:** Wait for the user to provide feedback and confirmation before starting the loop for the next item.

18. Pixelmatch: Meticulous Visual Comparison When Precision Matters [REFERENCE — Tool Technique]

Date: 2026-05-29

When comparing two images and you need to be extremely meticulous — detecting even subtle pixel-level differences — use a pixelmatch-style algorithm rather than relying on AI vision to subjectively judge screenshots.

How It Works Mechanically

  1. The algorithm reads two image buffers (e.g., a baseline screenshot and a test screenshot) into memory.
  2. It operates in the YIQ color space (a color model weighted to match human perception) rather than raw RGB, so it detects differences the way a human eye would notice them.
  3. It compares pixel-by-pixel, using a configurable threshold parameter (e.g., 0.1 or 0.2) to ignore negligible anti-aliasing artifacts caused by hardware rendering differences or sub-pixel shifts.
  4. It outputs a visual difference map: identical pixels are transparent/dimmed, changed pixels are painted in a stark contrast color (typically bright red or magenta).
  5. It also returns a single integer: the total number of mismatched pixels. If this is 0, the images are mathematically identical.

When to Use This

  • Verifying that a UI/website layout hasn't regressed after a code change
  • Comparing before/after screenshots when subtle shifts matter
  • Any situation where AI vision saying "looks the same" is insufficient

When NOT to Use This

  • General-purpose "does this page look OK?" checks where minor shifts are acceptable
  • Comparing images with intentionally dynamic content (timestamps, ads) unless you mask those regions first

Implementation

Libraries: pixelmatch (Node.js), ImageMagick's compare command, Python's Pillow + numpy for custom implementations.

Note: This is a reference technique, not an active requirement in any current workflow. It exists here so future sessions can reference it when extreme visual precision is needed.

19. HTML Text Diffing: Browser-Native Track Changes [REFERENCE — Tool Technique]

Date: 2026-05-29

When comparing two text blobs (e.g., the visible text of a live website vs. a localhost build) and you need to show the user what changed, generate an HTML diff page instead of a Word document with track changes.

How It Works Mechanically

  1. A script extracts the visible text from two sources (e.g., the live URL and the localhost URL).
  2. The two text blobs are fed into a diff algorithm — these are well-established algorithms (Myers, patience diff, etc.) that compute the minimum set of insertions, deletions, and unchanged spans to transform string A into string B.
  3. The algorithm outputs a structured list of change operations (e.g., "keep these 50 words, delete this sentence, insert this sentence").
  4. A rendering library converts that structured list into styled HTML:
    • Side-by-side mode: Left column shows old text, right column shows new text, with additions highlighted in green and deletions highlighted in red.
    • Inline mode: A single column where deleted words are struck through in red and new words are highlighted in green.
  5. The output is a single self-contained HTML file that can be opened directly in the browser.

Why This Is Better Than Word Documents

  • The entire review process stays in the browser — no context-switching to a different application
  • Visual regression diffs (from pixelmatch) and text diffs can be reviewed in the same tool
  • HTML files are self-contained and shareable without requiring Word to be installed
  • The diff rendering is deterministic — no AI judgment involved in what gets highlighted

Implementation

Libraries: diff2html (Node.js — renders GitHub-style diffs), jsdiff (Node.js — computes text diffs), Google's diff-match-patch (cross-language), Python's difflib with custom HTML output.

Note: This is a reference technique for future implementation in the website verification pipeline (workflow [2. find-website-changes workflow](/workflows/2. find-website-changes)). It would replace the current .docx track changes approach.

This is used in: