What is a data catalog and why does it matter in medallion architecture?

A data catalog is a metadata management system that registers, describes, and tracks all datasets across the Bronze, Silver, and Gold layers. In medallion architecture it serves three functions: discoverability (consumers can find certified data products without tribal knowledge), lineage (every dataset documents where it came from and how it was transformed), and governance (ownership, quality scores, and access policies are maintained centrally). Without a data catalog, the Gold layer becomes as opaque as the data swamp the medallion pattern was designed to prevent.

banking

Medallion Architecture: What the Theory Leaves Out

Q: What is medallion architecture in data engineering?

Medallion architecture is a layered data design pattern that organises data in a lakehouse into three progressive quality zones: Bronze (raw ingestion), Silver (cleansed, conformed, domain-organised), and Gold (certified, business-ready data products). Each layer has distinct data quality standards, transformation rules, and consumer audiences. The pattern improves data quality incrementally and provides end-to-end data lineage from source to consumption.

Q: What is the difference between SCD Type 1 and SCD Type 2?

SCD Type 1 (Slowly Changing Dimension Type 1) overwrites the existing record when an attribute changes, preserving no history. It is appropriate when historical values are irrelevant and you only need the current state. SCD Type 2 inserts a new record for each change, adding effective date, expiry date, and a current flag to maintain full history. It is appropriate when you need to track how a dimension attribute changed over time — for example, tracking a customer's address changes or an employee's department history. The choice between SCD1 and SCD2 depends on whether historical accuracy matters for the analytics question being answered.

Q: What is left-shift data quality in medallion architecture?

Left-shift data quality means moving data quality checks as early as possible in the pipeline — ideally at the Bronze layer, the point of ingestion. In traditional architectures, data quality failures are discovered at the Gold layer or in production reports, where remediation is expensive and impact is wide. Left-shifting DQ to Bronze catches issues at source, prevents propagation downstream, and makes the cost of remediation minimal because no downstream transformations have consumed the bad data yet.

28 March 2026 · 15 min read · ... views

I have built production medallion architectures across 100+ data sources and 30+ business domains — at Unilever, DIAGEO, Citi, and JIO Financial Services. What follows is not a restatement of the Databricks documentation. It is what I have learned from the gap between the theory and the production reality.

Most articles on medallion architecture were written for the BI era. They describe a clean, three-layer pattern in which data moves from raw ingestion through cleansing to business-ready reporting. The pattern is correct. The emphasis is wrong.

When AI models consume your Gold layer in production — not dashboards, not analysts, but automated inference pipelines with no human in the loop — the architecture decisions that matter change fundamentally. Quality gates, lineage depth, certification standards, and the organisational model around the data all need to be designed differently from the outset.

This article covers the full architecture in the sequence a practitioner needs it: what each layer actually does in production, where the standard advice fails, how data quality and lineage work across the stack, what SCD Type 1 and Type 2 mean and when to use each, what a data catalog must do to be useful, and the four moves a CDO can make today to shift from a BI-era medallion to an AI-ready one.

The Bronze layer — raw, immutable, and the only place you can replay from

Bronze is the ingestion layer. Its single job is to receive data from source systems and preserve it exactly as it arrived — format, schema, anomalies, and all. Nothing is corrected in Bronze. Nothing is rejected. Everything lands.

This is not a philosophical position about data fidelity. It is an operational requirement. Bronze is the only layer from which you can replay a pipeline end to end. If a transformation bug corrupts your Silver or Gold tables, you reprocess from Bronze. If a regulatory audit requires you to demonstrate exactly what data you held at a specific point in time, Bronze is your answer. If a machine learning model produces an unexpected output and you need to trace it back to its training data, Bronze is where the chain begins.

In practice, Bronze tables carry a small set of mandatory metadata columns that are added at ingestion time, regardless of the source format:

The ingestion timestamp records when the record arrived in the platform — distinct from any business timestamp in the source data. The source system identifier records which upstream system the data came from. The batch or pipeline run ID ties the record to the specific execution that brought it in. The raw payload hash is a deterministic fingerprint of the record as received, used for deduplication and change detection downstream.

These four columns are non-negotiable. Every Bronze table carries them. They are the foundation of data lineage — without them, the chain from Gold back to source cannot be reliably reconstructed.

What Bronze is not

Bronze is not a data quality layer. It is not a transformation layer. It is not accessible to business analysts. In every implementation I have led — including the Unilever global data platform covering 190+ markets and the JIO Financial Services lakehouse — Bronze access is restricted to data engineering. Business stakeholders never see Bronze. The data they see has been quality-checked, conformed, and certified in Silver and Gold.

The temptation to perform light transformations in Bronze — standardising date formats, correcting obvious encoding issues — should be resisted. Every transformation applied in Bronze reduces the fidelity of the record as a replay source and creates a dependency between the ingestion layer and business logic that will eventually break in a way that is difficult to diagnose.

Left-shifting data quality to Bronze

Here is where standard advice diverges from production practice. Most documentation places data quality checks in the Silver layer. This is a mistake that compounds over time.

Left-shifting DQ to Bronze means running a lightweight quality check at the point of ingestion — before any downstream transformation has consumed the data. The check does not reject records or block ingestion. It flags them, quarantines them, and alerts the owning team. The key distinction is that flagging happens at source arrival, not two hops downstream where the cost of remediation is exponentially higher.

In the Unilever implementation, we ran three Bronze-layer checks on every ingested dataset: schema conformance against the expected source schema, null count against mandatory fields defined in the source contract, and duplicate key detection against the natural key of the entity. Records that failed any of these checks were routed to a quarantine table alongside their failure reason, while records that passed proceeded to Silver processing. The quarantine table was monitored by the source system owner, not the data engineering team — because the failure was a source system problem, not a pipeline problem.

This distinction — source system owner owns Bronze quarantine, data engineering owns Silver transformation quality — is the organisational design that makes left-shift DQ work in practice.

The Silver layer — domain-organised, MDM-ready, and the source of enterprise truth

Silver is where data becomes an enterprise asset. The Bronze layer is organised by source system — one table per source entity, preserving source structure. The Silver layer is organised by business domain — customer, product, transaction, supplier, employee — regardless of which source system the data came from.

This is the most consequential structural decision in the entire medallion architecture, and it is the decision most implementations get wrong.

Organising Silver by source system is natural and fast. You have a Salesforce opportunity table in Bronze, you create a Salesforce opportunity table in Silver. You have an SAP sales order in Bronze, you create an SAP sales order in Silver. The pipeline is simple. The schema is familiar. The data engineers are productive immediately.

The problem surfaces six months later when the business asks for a unified customer view across Salesforce, SAP, and the proprietary CRM system. At that point you discover that your Silver layer is not a conformed enterprise view — it is three separate source system mirrors with different customer identifiers, different date conventions, different status codes, and no common key that would allow them to be joined reliably.

Organising Silver by domain means making the harder decision upfront: define what a Customer entity means across the enterprise, map every source system's customer representation to that canonical definition, resolve identity across systems using a matching and merging process, and store the result as a single canonical Customer entity in Silver that every downstream consumer can rely on.

This is Master Data Management embedded in the Silver layer architecture. It is harder to build. It takes longer to design. It pays back within the first year when the business can answer a question — how many unique customers do we have across all channels — that was previously unanswerable.

Data quality in Silver

Silver carries the full DQ certification stack — not just the lightweight checks that Bronze performs at ingestion, but the complete set of business rule validations that determine whether a record is fit for downstream consumption.

The DQ checks in Silver operate in five dimensions. Completeness validates that mandatory attributes are populated. Accuracy validates that attribute values conform to the expected domain — a valid ISO country code, a date within a plausible range, a numeric value within defined bounds. Consistency validates that related attributes are internally coherent — a closed date that is not before an open date, a unit price that is consistent with the total price and quantity. Uniqueness validates that the natural key of the entity does not contain duplicates within the Silver table. Timeliness validates that the data arrived within the expected SLA window from the source system.

Each record in Silver carries a DQ score — a composite measure of the checks it has passed and failed — alongside individual check results. Records with DQ scores below a defined threshold are excluded from Gold promotion. Records with isolated failures on non-critical attributes are promoted to Gold with the failure flags preserved, allowing Gold consumers to make informed decisions about whether to include those records in their calculations.

SCD Type 1 and SCD Type 2 in the Silver layer

Slowly Changing Dimensions — the mechanism for tracking how entity attributes change over time — are implemented in the Silver layer, not Gold. The choice between Type 1 and Type 2 is one of the most consequential data modelling decisions in the architecture, and it needs to be made deliberately for each entity attribute, not applied uniformly to entire tables.

SCD Type 1 — current state only. When an attribute changes, the existing record is overwritten. No history is preserved. The table always reflects the current value of every attribute.

SCD Type 1 is appropriate when historical values are operationally irrelevant. A customer's email address, for most analytics purposes, should reflect the current address — you do not need to know what their email was three years ago to calculate their current lifetime value. A product's current list price, when all you need is the current price for catalogue display, does not require historical tracking. A store's current status — open or closed — when you are counting active stores today, does not need a history of every status transition.

In the JIO Financial Services implementation, we applied SCD Type 1 to customer contact attributes — email, mobile number, communication preference — because the analytics questions those attributes answer are always about the current state. Applying SCD Type 2 to those attributes would have added table complexity and storage cost with no analytical value.

SCD Type 2 — full history preserved. When an attribute changes, a new record is inserted alongside the existing one. The new record carries an effective date (when the change took effect) and the previous record receives an expiry date. A current flag or end date of a sentinel value — typically 9999-12-31 — marks the active record.

SCD Type 2 is appropriate when the historical value of an attribute is needed to answer analytics questions accurately. A customer's segment classification — premium, standard, basic — affects revenue calculations for historical periods. If a customer was classified as standard for six months and premium for the last three months, calculating their revenue contribution during the standard period requires knowing they were standard at that time. Overwriting with the current premium classification — SCD Type 1 — would produce an inaccurate historical revenue attribution.

Similarly, a product's category hierarchy changes when a business restructures. Reporting historical sales by the current category hierarchy misattributes past performance. SCD Type 2 on category hierarchy allows you to report historical periods under the category structure that was in place at that time.

In the DIAGEO implementation, we maintained SCD Type 2 on product hierarchy, market classification, and customer segment — the three dimensions where historical accuracy directly affected financial reporting reconciliation. We maintained SCD Type 1 on all other product and customer attributes. The distinction was not made at the table level but at the attribute level within each table, with each attribute's change-tracking strategy documented in the data catalog.

The practical implementation of SCD Type 2 in a lakehouse environment uses a merge operation — a MERGE INTO statement that compares incoming records against the existing Silver table, closes out changed records by setting their expiry date to the current effective date, and inserts new active records for the changed attribute values. Delta Lake's MERGE capability makes this pattern efficient and ACID-compliant.

The Gold layer — certified, trusted, and AI-ready

Gold is where data becomes a product. Not a dataset. Not a table. A certified data product with a defined owner, a documented quality standard, a measured freshness SLA, and a DQ certification badge that tells every consumer — human or machine — whether this data can be trusted for its intended use.

The Gold layer contains two categories of data product. KPI and metric tables aggregate Silver entities into the business performance measures that drive decisions — net revenue, customer acquisition cost, product return rate, settlement volume, transaction throughput. These are the numbers that appear in board decks, operational dashboards, and regulatory submissions. They carry formal business definitions, documented calculation methodologies, and sign-off from business owners.

Certified cross-domain data products join entities from multiple Silver domains into enriched views that answer specific business questions. A customer transaction profile joins the Customer entity from the Customer domain with the Transaction entity from the Payments domain and the Product entity from the Product domain to produce a view that supports customer lifetime value calculations, churn prediction, and personalisation. This join cannot happen in Silver — Silver entities are domain-isolated. It happens in Gold, where the domain isolation of Silver is deliberately relaxed to serve a defined analytical or operational use case.

DQ certification badges

Every Gold table carries a DQ certification badge — a structured metadata object that records the quality status of the data product at a point in time. The badge contains the overall DQ score, the result of each individual DQ check, the timestamp of the last quality assessment, the name of the data product owner who has certified the data, and the use cases for which the data has been certified as fit.

The certification badge is not cosmetic. It is the mechanism by which AI model pipelines make automated decisions about whether to consume a Gold table. In the JIO implementation, every model inference pipeline reads the DQ certification badge of its input datasets at run time. If the overall DQ score is below a defined threshold, or if the data freshness exceeds the SLA, the inference run is suspended and an alert is raised — rather than producing predictions from degraded input data and allowing those predictions to propagate through production systems undetected.

This is what changes when AI consumes your Gold layer. A dashboard with stale data is an inconvenience. A credit decisioning model that produces recommendations from three-day-old data is a risk event.

Data lineage across the full stack

Data lineage is the ability to trace any data point in Gold back through Silver and Bronze to the source system record that originated it — and forward from any source record through every transformation to every Gold table and downstream system that consumed it.

End-to-end lineage is not a reporting feature. It is an operational capability that serves four distinct functions in production.

Impact analysis: when a source system changes its schema — a field is renamed, a code value is added, a mandatory field becomes nullable — lineage tells you every Silver and Gold table that will be affected before you deploy the change. Without lineage, schema change impact is discovered in production failures.

Root cause diagnosis: when a Gold metric produces an unexpected value, lineage allows you to trace which Silver entities contributed to the calculation, which Bronze tables fed those Silver entities, and which source system records are the origin of the anomaly. Without lineage, root cause diagnosis is a manual investigation that takes days.

Regulatory compliance: when a regulator asks you to demonstrate that a specific calculation is based on data that was complete and accurate as of a specific date, lineage allows you to produce that evidence programmatically. Without lineage, compliance evidence is assembled from audit logs, pipeline documentation, and institutional memory — a process that is slow, error-prone, and difficult to repeat.

AI model governance: when a production AI model produces a harmful output and needs to be audited, lineage allows you to identify exactly which training data contributed to the model's behaviour. This is not a theoretical requirement — it is increasingly a regulatory expectation in financial services, where the FCA and RBI have both indicated that model explainability requirements extend to data provenance.

In practice, lineage is captured at three levels. Column-level lineage tracks which source columns contribute to which target columns through each transformation. Table-level lineage tracks which tables feed which tables at each layer boundary. System-level lineage tracks which source systems, pipelines, and execution runs contributed to which data products.

Modern lakehouse platforms — Databricks Unity Catalog, Microsoft Purview, Apache Atlas — capture column-level lineage automatically for SQL transformations run within the platform. Custom Python transformations and external system integrations require explicit lineage instrumentation, which is a design constraint that every architecture team needs to plan for explicitly rather than discovering post-implementation.

The data catalog — discoverability, governance, and the bridge between layers

A data catalog is a metadata management system that serves as the single register of all data assets across the Bronze, Silver, and Gold layers. Without a catalog, the Gold layer becomes as opaque as the data swamp the medallion pattern was designed to prevent — certified data products exist but nobody knows they exist, nobody knows what they contain, and nobody knows whether they are appropriate for a given use case.

A production data catalog in a medallion architecture performs five functions that documentation typically underspecifies.

Asset registration. Every table, view, and data product in every layer is registered in the catalog with a unique identifier, a description written in business language (not technical jargon), the data domain it belongs to, and the owning team. Registration is automated — tables created by pipeline runs are registered at creation time, not manually catalogued after the fact.

Business glossary alignment. Every entity and attribute in the Silver layer is mapped to the business glossary — the authoritative definition of what Customer, Product, Transaction, and every other business concept means in the organisation. This alignment is what allows a business analyst to search the catalog for customer data and find the canonical Silver customer entity rather than twelve source system extracts with different definitions of the same concept.

DQ score publication. The catalog publishes the current DQ score and certification status of every Gold data product. Consumers — human and automated — can query the catalog before accessing a data product to determine its quality status without needing to run the quality checks themselves.

Lineage visualisation. The catalog exposes end-to-end lineage as a graph that data stewards, engineers, and architects can navigate. From any data product, you can traverse upstream to its Bronze sources or downstream to every system and model that consumes it.

Access control integration. The catalog is the interface through which data access requests are made, reviewed, and approved. Access policies maintained in the catalog are enforced by the lakehouse platform — analysts who have been granted access to a specific Gold data product can query it; analysts who have not cannot see it exists.

In the Citi implementation, the data catalog was the first deliverable of the platform build — not the last. This sequencing decision — catalog first, data product second — forced the architecture team to define business glossary, ownership model, and DQ standards before writing a single pipeline. It added three months to the build timeline. It prevented the six-month remediation programme that typically follows a platform build where cataloguing is treated as documentation rather than architecture.

Beyond Gold — ontology, knowledge graphs, and the next order of value

The medallion architecture as traditionally described terminates at Gold. In practice, the most analytically advanced organisations are building a layer beyond Gold that uses ontology and knowledge graphs to represent the relationships between entities rather than the entities themselves.

A Gold table contains Customer data. A Gold table contains Product data. A Gold table contains Transaction data. What a knowledge graph adds is the explicit representation of the relationships between these entities — this Customer purchased this Product through this Channel at this Time, creating a graph structure that can be traversed to answer questions that relational joins cannot efficiently answer.

Recommendation engines, supply chain optimisation, fraud network detection, and drug interaction analysis are all problems that are naturally expressed as graph queries rather than SQL joins. The knowledge graph layer does not replace Gold — it consumes Gold's certified data products and represents their relationships in a graph structure that a different class of analytics engine can query.

In the Unilever implementation, we modelled the relationship between brand, product variant, market, channel, and consumer segment as a knowledge graph. This allowed the commercial team to answer questions about substitution patterns — when a specific variant was out of stock in a specific market, which alternative variants the same consumer segment had historically selected — without the multi-table join complexity that the relational Gold structure would have required.

Four moves a CDO can make today

These are not multi-year transformation programmes. They are specific, bounded changes that can be made within the current architecture to materially improve the Gold layer's readiness for AI consumption.

Move 1: Instrument Bronze with DQ flags this quarter. Add three columns to every Bronze table — a null count flag for mandatory fields, a schema conformance flag, and a duplicate key flag. Route records that fail any check to a quarantine table. Assign quarantine ownership to source system teams. This single change makes source data quality visible before it propagates downstream and shifts accountability to where the problem originates.

Move 2: Audit Silver for source-system organisation and begin domain consolidation for your highest-value entity. Identify which Silver tables are source-system mirrors rather than domain-organised entities. Pick the single most important entity — Customer, Product, or whatever drives the most downstream consumption — and build the domain-consolidated version. One entity, done properly, demonstrates the pattern and builds the organisational muscle before you attempt the full migration.

Move 3: Attach DQ certification badges to your three most-consumed Gold tables. Define the DQ threshold above which each table is certified as fit for production consumption. Automate the badge update on every pipeline run. This is the change that allows AI inference pipelines to gate themselves on data quality without human intervention.

Move 4: Mandate lineage instrumentation for all new pipelines. New pipelines are easier to instrument than existing ones. Make column-level lineage instrumentation a pipeline acceptance criterion — no new pipeline moves to production without lineage being registered in the catalog. The existing pipeline debt can be addressed incrementally; the new debt stops accumulating immediately.

Frequently asked questions

What is medallion architecture in data engineering?

A layered data design pattern that organises a lakehouse into three progressive quality zones: Bronze (raw, immutable ingestion), Silver (cleansed, conformed, domain-organised enterprise entities), and Gold (certified, business-ready data products). Each layer has distinct quality standards, transformation rules, and consumer audiences. The pattern provides end-to-end lineage and prevents data lakes from becoming data swamps.

What is the difference between SCD Type 1 and SCD Type 2?

SCD Type 1 overwrites the existing record when an attribute changes, preserving only the current state. Use it when historical values are irrelevant — current email address, current product price. SCD Type 2 inserts a new record for each change with effective and expiry dates, preserving full history. Use it when historical accuracy matters for analytics — customer segment at time of transaction, product category at time of sale. The choice should be made per attribute, not per table.

What is left-shift data quality in medallion architecture?

Moving data quality checks to the Bronze layer — the point of ingestion — rather than discovering quality failures in Silver or Gold where remediation costs are exponentially higher. Bronze DQ flags records at source arrival without blocking ingestion, routes failures to quarantine tables owned by source system teams, and prevents bad data from propagating through the transformation chain.

What is a data catalog and why does it matter?

A metadata management system that registers every data asset across all layers, aligns entities to the business glossary, publishes DQ scores and certification status, exposes end-to-end lineage as a navigable graph, and integrates with access control. Without a catalog, Gold data products exist but are undiscoverable, their quality is opaque, and their lineage cannot be traced — recreating the data swamp problem at a higher layer of abstraction.

Why does AI consumption change the Gold layer requirements?

A dashboard with stale or degraded data is an inconvenience that a human analyst can detect and account for. An AI model consuming the same data produces outputs that propagate through production systems before anyone notices the input quality degraded. Gold tables consumed by AI pipelines need machine-readable DQ certification badges that inference pipelines can query at run time to gate themselves on data quality — a requirement that does not exist in a pure BI consumption model.

Join the conversation

Loading comments...