The Architecture Behind Every Enterprise AI System

24 March 2026 · 11 min read · ... views

Most enterprise AI failures are not model failures. They are pipeline failures. The model is fine. The context fed to it is not. Understanding the architecture that governs how enterprise data becomes model context is the difference between a demo and a production system.

The diagram above maps that architecture precisely. Four layers. One feedback loop. Every reliable enterprise AI deployment — whether it runs on GPT, Claude, or Llama — sits on top of this structure or a close variant of it. The details change. The shape does not.

This article walks through each layer, explains what it does, and translates the engineering decisions into the business and operational consequences they produce.

Enterprise AI context pipeline — four-layer architecture
From raw enterprise data to model response, with continuous feedback loop
Source layer Where enterprise knowledge lives Structured data DBs, warehouses Unstructured docs PDFs, wikis, emails Real-time signals Events, APIs, streams Memory store History, user state MCP — model context protocol Standardised connectivity between sources and pipeline MCP servers Per source Tools + resources Callable actions Auth + audit trail Full traceability Policy sampling Compliant access Infrastructure layer Transform raw data into precise, policy-compliant context Retrieve + rank RAG, vector search Filter + govern PII removal, access control Compress Summarise, truncate Assemble Prompt templates system prompt retrieved context user turn context window allocation — governed at pipeline level Application layer Run, evaluate, and improve the context pipeline over time Model router GPT, Claude, Llama Eval + scoring Faithfulness, recall Observability Traces, cost, latency Feedback loop RLHF, fine-tuning improves sources via MCP

Layer 1 — the source layer

The source layer is where enterprise knowledge actually lives. It is not a single database. It is a heterogeneous estate of four fundamentally different data types, each with different update frequencies, access patterns, and trust characteristics.

Structured data — databases, data warehouses, data marts — is the most queryable but the most constrained. Every record has a defined schema, a clear owner, and usually a governance policy attached. It is the easiest to connect, but the least flexible: you can retrieve exactly what you ask for, nothing more.

Unstructured documents — PDFs, wikis, email threads, SharePoint folders — contain the institutional memory that never makes it into a formal system. This is where the real knowledge lives in most organisations. It is also where the quality problems concentrate: outdated versions, contradictory guidance, uncontrolled access.

Real-time signals — events, API responses, streaming data — represent what is happening now. For AI systems making operational decisions — trade execution, fraud detection, customer service — the freshness of this signal determines the utility of the response. Stale context produces confident wrong answers.

Memory store — conversation history, user state, session context — is what gives AI systems the ability to be genuinely useful over time rather than starting from scratch on every query. It is the least mature layer in most enterprise deployments and the one that requires the most careful design around retention, privacy, and access control.

The quality of your AI system's output is bounded by the quality of the data in this layer. No amount of infrastructure engineering rescues a source layer that is corrupted, stale, or ungoverned.

Layer 1 — Source

What can go wrong here

Duplicate records across systems. Conflicting versions of the same document. Beneficial ownership linkages missing from corporate master data. Identity fields that have never been remediated. These are not AI problems — they are data quality problems that the AI system will faithfully propagate into every response it generates.

Structured data Unstructured docs Real-time signals Memory store

Layer 2 — MCP, the connectivity standard

The Model Context Protocol is the layer most organisations skip or build ad hoc — and then spend years retrofitting. Its purpose is deceptively simple: standardise how every data source connects to the AI pipeline so that connectivity, authentication, and audit are not reinvented per source.

Without MCP or an equivalent governed connectivity layer, every source-to-pipeline connection becomes a bespoke integration. You end up with twelve different authentication patterns, no consistent audit trail, and a pipeline that is effectively ungovernable at scale. That is fine for a prototype. It is a regulatory event waiting to happen in a Tier 1 financial institution.

MCP provides four things. MCP servers per source — each data source gets its own server that handles the details of connection, formatting, and refresh. Tools and resources — the callable actions the AI system can invoke on each source, defined explicitly rather than discovered implicitly. Auth and audit trail — every data retrieval is authenticated and logged, which is the prerequisite for any regulatory environment. Policy-compliant sampling — the protocol enforces access control at retrieval time, not as a post-hoc filter.

Layer 2 — MCP

Why this layer matters for regulated industries

In banking, healthcare, and any environment subject to data governance regulation, the audit trail is not optional. An AI system that retrieves customer data without a retrievable log of what was accessed, when, by whom, and under what policy authorisation is not a compliant system — regardless of how accurate its responses are. MCP creates that trail as a first-class output, not an afterthought.

MCP servers Tools + resources Auth + audit trail Policy sampling

Layer 3 — the infrastructure layer

This is where raw data becomes usable context. The infrastructure layer is a four-step processing pipeline that runs between data retrieval and prompt assembly. Each step is a decision point with real consequences for response quality, cost, latency, and compliance.

Retrieve and rank — RAG

The retrieve-and-rank step is where Retrieval-Augmented Generation lives. Vector embeddings convert source documents into numerical representations that capture semantic meaning. When a query arrives, the system finds the most semantically similar chunks from the vector store and ranks them by relevance. The model then generates its response from those chunks, not from training memory alone.

The practical implication: the quality of your embeddings and the granularity of your chunking strategy determine what the model can and cannot find. Poor chunking splits concepts across boundaries. Poor embedding models fail to capture domain-specific terminology — a problem that is acute in specialised fields like medicine, law, and banking, where the vocabulary diverges sharply from general training corpora.

Filter and govern — the compliance gate

Before retrieved content reaches the model, it passes through a filter layer that applies PII removal, access control checks, and policy enforcement. This is where the business rules live — the same business rules that, in core banking migration projects, often exist only in senior analysts' memories and surface as migration failures when the new platform asks why things are done a certain way.

In practice, this step is where most enterprise AI governance work concentrates. Every exception, every data classification, every access control rule has to be codified here. The quality of this layer is directly proportional to the institutional knowledge invested in building it.

Compress — fitting the context window

Context windows are finite. The compress step takes filtered content and reduces it to what actually fits — through summarisation, truncation, deduplication, and relevance scoring. The decisions made here directly affect what the model can reason about. Material that is compressed out of the context window is material the model cannot reference.

Assemble — the prompt template

The final infrastructure step assembles the context into a prompt template. This is where the context window is divided into its three functional zones: system prompt, retrieved context, and user turn.

System prompt Persona, policies, instructions, constraints
Retrieved context RAG results, relevant documents, memory
User turn The actual query from the end user

The allocation of context window budget across these three zones — governed at pipeline level, not by the model — is one of the most consequential architectural decisions in an enterprise AI deployment. Too much system prompt leaves no room for retrieved context. Too little leaves the model without constraints. The balance determines whether the system behaves correctly at the edges.

Layer 3 — Infrastructure

The four processing stages and their failure modes

Retrieve + rank fails when embeddings miss domain-specific semantics or chunking splits concepts across boundaries. Filter + govern fails when business rules are incomplete or inconsistently applied. Compress fails when summarisation loses critical detail or truncation removes the most relevant content. Assemble fails when context window allocation leaves the model without enough retrieved material to answer accurately.

RAG + vector search PII filtering Access control Summarisation Prompt templates

Layer 4 — the application layer

The application layer is where the pipeline meets operational reality — and where most organisations stop investing too early. It has four components, and the fourth is the one that makes the entire system get better over time rather than degrade.

Model router — not every query requires the most capable model. A well-designed model router sends simple queries to fast, cheap models and complex or sensitive queries to more capable ones. This is both a cost optimisation and a latency optimisation. In high-volume enterprise deployments, the routing logic is as important as the model selection itself.

Evaluation and scoring — faithfulness (does the response accurately reflect the retrieved context?), recall (did the system find the relevant material?), precision (is the retrieved material actually relevant?), and answer relevance (does the response address the question?). Without a systematic evaluation framework, you cannot know whether your system is degrading as the underlying data changes.

Observability — traces, cost per query, latency distributions, error rates. This is the operational telemetry layer. In regulated environments, it is also the audit evidence layer — the record of what the system did, when, at what cost, and with what outcome.

Feedback loop — this is the component shown as the right-side arrow in the diagram, running from the application layer back up to the source layer. The feedback loop is what converts operational experience into pipeline improvement: RLHF signals, fine-tuning data, catalogue updates, source quality assessments. A pipeline without a feedback loop is a pipeline that is guaranteed to degrade as the world changes around it.

The feedback loop is not a nice-to-have. It is the mechanism by which the system learns that a data source has become unreliable, that a business rule needs updating, or that a particular query type is consistently producing poor results. Without it, you are flying blind.

Layer 4 — Application

What most enterprise AI deployments are missing

Most enterprise AI deployments have a model router and some basic observability. Very few have a rigorous evaluation framework that runs continuously in production. Almost none have a systematic feedback loop that connects evaluation results back to source quality improvements. That gap is the gap between a system that works on day one and a system that is still working on day three hundred.

Model router Eval + scoring Observability RLHF feedback

What this means for technology leaders in banking

In 24 years across Citi and Standard Chartered, I have seen the same failure mode across every technology platform transition. Organisations invest heavily in the visible layer — the model, the interface, the demo — and underinvest in the invisible layers that determine whether the system actually works in production.

For enterprise AI, the invisible layers are the source layer and the MCP connectivity layer. The model is the easy part. Every major cloud provider gives you access to frontier models. The hard part is connecting those models to your data in a way that is governed, auditable, accurate, and maintainable.

In banking specifically, the source layer contains the exact problems that made AML systems fail when built on corrupted customer master data. The same customer data problems. The same beneficial ownership linkage gaps. The same stale identity records. If you build an AI system on top of that data without fixing the data quality, you are not building a compliance tool. You are building a sophisticated mechanism for generating confident wrong answers at scale.

Fix the pipeline first. Then optimise the model.


Frequently asked

What is the Model Context Protocol (MCP)?

MCP is a standardised connectivity layer between an enterprise's data sources and its AI pipeline. It provides per-source MCP servers, callable tools and resources, authentication and audit trails, and policy-compliant sampling — so every data source connects to the AI system in a governed, traceable, and reproducible way.

What is a context pipeline in enterprise AI?

A context pipeline is the sequence of steps that transforms raw enterprise data into the precise, policy-compliant text that fills an AI model's context window. It includes retrieval and ranking (RAG), filtering and governance (PII, access control), compression (summarisation, truncation), and assembly into prompt templates. The model's response quality is determined entirely by the quality of what this pipeline produces.

What is RAG and where does it fit?

RAG — Retrieval-Augmented Generation — is the process of retrieving relevant documents from a vector store before generating an AI response, so the model answers from actual enterprise data rather than training memory alone. It sits at the retrieve-and-rank stage of the infrastructure layer, using vector embeddings to find semantically relevant chunks.

How is the context window managed in enterprise AI systems?

The context window is allocated across three zones governed at pipeline level: system prompt (policies, persona, instructions), retrieved context (RAG results, relevant documents), and user turn (the actual query). The infrastructure layer controls how much space each zone gets through compression and truncation — before the prompt ever reaches the model.


Raj Thilak is Head of Technology for Data & Analytics and Director of Engineering & Delivery with 24 years of experience in banking and financial services, including leadership roles at Citi and Standard Chartered. He holds certifications as an Azure Architect, GCP Architect, and Project Management Professional. Based in Pune, India.

Found this useful? Subscribe for weekly insights.

Discussion

Join the conversation

Loading comments...

Leave a comment
0 / 2000