Why I Separated Memory from Reasoning in My Tax Advisory AI — and Why It Was the Right Call

Adarsh Godiyal — Sat, 06 Jun 2026 17:08:36 GMT

Most AI systems that touch financial data eventually fail the same way: the LLM hallucinates a number it was never given, and someone files the wrong return. I wanted to build something that simply could not do that, even if the prompt was ambiguous or the client history was thin. That constraint shaped every architectural decision in CAI — Chartered Accountant Intelligence.

What CAI Actually Does

Chartered Accountants in India manage a brutal cognitive load. A single CA often handles dozens of clients across multiple Assessment Years, each with their own salary slips, GST notices, deduction declarations, capital gains schedules, and scrutiny intimations from the Income Tax Department. Every new client interaction requires excavating old Form 16s, cross-referencing prior filings, and mentally reconstructing the client’s financial picture from scratch.

CAI is built to carry that context instead. It’s a multi-agent system where each agent is specialized: one routes intent, one synthesizes advisory responses, one parses uploaded PDFs, one tracks notice deadlines, one computes year-over-year deltas, and one flags anomalies. All of them are anchored to a persistent memory store powered by Vectorize Hindsight — a purpose-built agent memory system that provides vector-backed recall across sessions.

The frontend is a React/Vite SPA with a split-screen layout: on one side, a conversational advisory chat panel where CAs ask questions and upload documents; on the other, a Memory Audit View showing the indexed fact graph for the active client, including confidence scores and Assessment Year tags. The backend is FastAPI, and every agent execution is traced end-to-end in LangSmith.

Show Image The CAI dashboard. Left: the advisory chat interface with multi-agent routing tags visible beneath each response. Right: the Memory Audit View showing client facts indexed by namespace, confidence score, and Assessment Year.

The Core Problem: Stateless LLMs in a Stateful Domain

Before committing to the architecture, I had to be precise about what “context” actually means in tax work. It’s not just conversation history — it’s structured financial facts that need to survive not just within a session, but across years. When a CA asks “what was Ramesh’s effective tax rate in AY2023-24 compared to this year?”, the model needs to recall gross income figures, deductions claimed, and tax paid for two different Assessment Years and compare them without inventing either number.

Standard approaches fall apart here:

In-context stuffing: You could try including the full client history in the system prompt every time. For a three-year client with multiple documents, that’s thousands of tokens on every single query, most of which are irrelevant to the question at hand.
RAG over a general vector store: You get retrieval, but you lose namespace isolation. Client A’s tax history should never surface in a retrieval call for Client B. You also lose the structured metadata — Assessment Year, confidence, document source — that makes a retrieved fact trustworthy rather than just plausible.
Conversation memory APIs: These record what was said, not what was verified. A CA telling the model “Ramesh’s gross salary was ₹15 lakh” in one session and “₹14 lakh” in another creates a contradiction the model has no way to resolve.

What I needed was a dedicated memory layer that could store structured, verified facts, tag them with domain-specific metadata, filter recalls by namespace, and — critically — expose staleness so the model could flag uncertainty rather than paper over it.

That’s what Hindsight’s persistent agent memory architecture gives you.

The Architecture: Reasoning Layer and Memory Layer Are Decoupled

The single most important design decision in CAI is that the LLM has no direct access to the database. It only sees what Hindsight retrieves.

┌───────────────────────┐         ┌───────────────────────┐
│    REASONING LAYER    │         │     MEMORY LAYER      │
│   (Groq / Llama-3)    │         │ (Vectorize Hindsight) │
│                       │         │                       │
│ • Intent Routing      │◄───────►│ • tax_history         │
│ • Advisory Synthesis  │         │ • notices             │
│ • Anomaly Detection   │         │ • deductions          │
└───────────────────────┘         │ • income              │
                                  │ • preferences         │
                                  └───────────────────────┘

Show Image Full system architecture. The FastAPI backend is the only component that talks to both layers. LLM agents never query the vector store directly — they receive retrieved context as plain text injected into their prompts.

Every client fact lives in Hindsight under a hierarchical namespace key:

client:{client_id}:{namespace}:{record_id}
# e.g. client:abcri1234d:tax_history:ay2024-25

When a new Form 16 arrives, the Document Extraction Agent runs PyMuPDF over the PDF, applies compliance-specific regex patterns to pull out gross salary, TDS, and PAN, then writes the result to Hindsight via aretain():

python

# From document.py — after regex extraction
fact_str = (
    f"Form 16 uploaded. Gross salary verified at ₹{gross:,}. "
    f"Total TDS: ₹{tds:,}. PAN: {pan}. AY: {ay}."
)
content = json.dumps({"key": key, "value": {"fact": fact_str}})
await hindsight.aretain(
    content=content,
    namespace=f"client:{client_id}:tax_history",
    tags=[client_id, "conf_95"]
)

The conf_95 tag is used downstream to signal retrieval quality. If the Advisory Agent gets back facts tagged with lower confidence, or where the vector timestamp indicates the memory is over nine months old, it’s instructed to include an explicit uncertainty warning in the response — rather than generating numbers with false authority.

How the Orchestrator Routes Intent

The Orchestrator Agent is the entry point for every query. It runs llama-3.3-70b-versatile at temperature zero — deterministic, no creativity. Its only job is to read the query and a 500-character “mental model” snapshot of the client from Hindsight, then output a rigid XML routing schema.

python

# From orchestrator.py — the prompt schema instruction
"""
Output ONLY valid XML:

  tax_query|notice|anomaly|advisory|yoy|document|general
  memory,advisory
  high|normal|low
  tax_history,notices,deductions,income,preferences

"""

The mental model snapshot is important. Rather than making the Orchestrator completely blind to client context before routing, Hindsight provides a 500-character synthetic summary — enough to distinguish “this client has open GST notices” from “this client has capital gains — route to the YoY agent” without dumping the full history into every routing call.

Once the XML is parsed, the backend fires concurrent arecall() calls for each required namespace:

python

# From main.py — parallel namespace retrieval
recall_tasks = [
    hindsight.arecall(
        query=user_query,
        namespace=f"client:{client_id}:{ns}",
        top_k=5
    )
    for ns in context_needed_namespaces
]
results = await asyncio.gather(*recall_tasks)

This is where the vector-backed agent memory approach pays off concretely. The recall isn’t keyword search — it’s semantic. A query about “Section 80D health insurance deductions” will retrieve the right fact even if it was stored as “medical insurance premium for family, ₹25,000” because the embeddings capture meaning, not text overlap.

Zero-Hallucination Guardrails in the Advisory Agent

The Advisory Agent is the only one that generates the final text the CA sees. Its system prompt contains an explicit, hard rule:

If memory_context is empty for a given query, do not infer or estimate. State: “No verified data found for this client on this topic.”

This is not a soft guideline. The prompt is structured so that the model receives the retrieved memory blocks under a labeled [VERIFIED MEMORY] section and the current-session document extractions under [DOCUMENT CONTEXT]. It is explicitly instructed that numbers outside these two sections cannot appear in the response.

In practice, this makes the system behave differently from a general-purpose chatbot in a very specific way: it will sometimes say “I don’t know” and that’s by design. A CA would rather see a clear “No prior data on Section 148 notice for this client” than a plausible-sounding but fabricated notice date.

The Anomaly Agent takes a similarly strict output discipline: it either emits FLAG: | SEVERITY: or a bare CLEAR. No narrative, no hedging. This early termination on CLEAR means the full anomaly analysis pipeline branch is skipped, saving tokens and latency on clean datasets.

Lessons Learned

1. Memory is a schema problem before it’s an embeddings problem. The temptation with vector stores is to throw everything in and let retrieval figure it out. That approach collapses quickly when facts from different years or different clients start bleeding into each other. Designing the namespace schema (client:{id}:{type}:{record}) before writing a single retain() call saved a lot of pain.

2. Deterministic routing is worth the rigidity. Using temperature zero and a strict XML output format for the Orchestrator Agent means routing failures are parse errors, not semantic confusions. When something breaks, you get a clear stack trace, not a mysteriously wrong downstream response.

3. Structured confidence metadata is what separates memory from history. The difference between a fact tagged conf_95 from a verified Form 16 upload and a fact inferred from a chat message is enormous in a compliance context. Hindsight’s tagging system makes this distinction first-class. The Advisory Agent can — and does — treat them differently.

4. Staleness detection should be a first-class output, not an afterthought. We added the nine-month timestamp check after realizing that a CA acting on a year-old TDS figure without knowing it was stale was arguably worse than getting no answer at all. Hindsight’s vector timestamps made this a two-line implementation rather than a separate data pipeline.

5. Exponential backoff at the LLM call site isn’t optional. With six agents potentially firing in one request, any one of them hitting a rate limit becomes a user-facing error without retry logic. The groq_call_with_retry wrapper (max three retries, 1/2/4-second backoff) turned what would have been intermittent 500s into transparent, self-healing pauses.

The broader principle CAI demonstrates is that persistent agent memory isn’t a nice-to-have for domains where facts have legal weight — it’s load-bearing infrastructure. Tax advisory, medical records, legal case management: any system that needs to be right about specific numbers across time and clients needs a memory layer with real structure, not just a longer context window.

The CAI source is on GitHub if you want to dig into the agent implementations, the Hindsight namespace schema, or the LangSmith tracing setup. The Hindsight documentation covers the retain() / recall() API surface in depth if you’re thinking about applying the same memory pattern to your own domain.

Adarsh's Substack

Why I Separated Memory from Reasoning in My Tax Advisory AI — and Why It Was the Right Call

The Core Problem: Stateless LLMs in a Stateful Domain

The Architecture: Reasoning Layer and Memory Layer Are Decoupled

How the Orchestrator Routes Intent

Zero-Hallucination Guardrails in the Advisory Agent

Lessons Learned