What AI-Ready Data Actually Looks Like: A Practical Checklist

Marc de Haas on Jun 15, 2026 16:46 | Updated: Jun 15, 2026 18:05

In many organisations right now, an AI project has been approved, a vendor has been selected, and a go-live date is somewhere on a roadmap. Six months in, the team is doing something different from what they planned, rebuilding data infrastructure rather than building on top of it.

The gap between where AI initiatives start and where they actually land usually has the same root cause: the data was not in a state the project needed it to be in, and nobody assessed that systematically before the scope was set.

AI readiness is not about having the most advanced tooling. It comes down to five things, most of which are organisational before they are technical.

Clean data and knowing where it is not

"Clean data" tends to get treated as a binary: either it is clean or it isn't, and the answer is usually "mostly". What matters more is whether the team understands where the quality gaps are and what downstream impact those gaps would have on the AI use case in question.

Undocumented quality issues are harder to manage than known ones. A dataset with acknowledged gaps, where the team has agreed on how to handle them, is far more reliable as a foundation than one that is assumed to be correct because nobody has looked closely. AI outputs derived from the second type fail quietly, the model continues producing results without any visible sign that the inputs were unreliable.

In practice, this means auditing the specific datasets your AI use case will depend on before the project scope is finalised. Not a full data quality programme, a targeted review of completeness, duplication rates, and freshness for those inputs specifically.

Clear ownership per dataset

Every dataset an AI system relies on needs an accountable owner, someone whose responsibility includes quality, definition, and availability over time, not just someone whose name appears in a system.

Ownership gaps tend to show up in AI projects in specific ways: a training dataset that turned out to include test accounts, a churn model trained on contract data that included cancelled pilots, a recommendation engine fed by a field nobody maintains anymore. In each case, the model was doing what it was designed to do. The problem was in the data it trusted.

Assigning ownership creates friction, and teams often resist it. The pushback is understandable, it adds accountability where there was none before. But the cost of unowned data propagating through a production AI system is higher than the cost of the conversation needed to establish who is responsible for what.

Documented pipelines

If the path data takes from source system to model input cannot be explained clearly, the outputs of that model cannot be fully trusted, not because the model is wrong, but because the inputs are not understood well enough to know when they change in ways that matter.

The most common gap here is not the absence of documentation across the board, but the absence of documentation at transformation points. What happened to this field between the source system and the warehouse? Was it filtered, aggregated, or imputed at some stage? When a data engineer is the only person who knows the answer to that question, the system has a dependency that will eventually create a problem.

Lineage tooling, BigQuery's built-in data lineage, Dataplex, or lightweight catalogue tooling, makes these transformation paths visible and auditable. For AI systems that need to be monitored and debugged over time, that visibility is not optional infrastructure. It is what makes the difference between a system you can maintain and one you can only replace.

Data pipeline dashboard on laptop screen with red coffee mug

A mechanism for detecting when outputs degrade

Most AI projects plan carefully for launch. Fewer plan for what happens six months later, when data distributions shift, when source systems change upstream, or when the model's assumptions about the world gradually stop reflecting reality.

A feedback loop in this context means a monitored signal that tells someone when something has gone wrong, data drift detection, output distribution monitoring, or downstream business metrics that correlate with model performance. Without it, degradation happens quietly. The model keeps producing outputs, the team keeps trusting them, and the first visible sign of a problem tends to be a business decision that nobody can explain.

Building this into the project design from the start is easier than adding it retrospectively. It also requires a clear answer to the question of who is responsible for monitoring it, which tends to surface the same ownership questions the rest of this list raises.

Discipline around fixing before extending

There is a pattern that recurs in AI project post-mortems: a quality issue appeared, and rather than addressing it directly, the team routed around it. A transformation was added to compensate. A separate pipeline was introduced. The workaround worked well enough to keep the project moving, and the underlying issue stayed in place underneath.

Over time, these compensations compound. The data foundation becomes more complex, the number of people who understand it shrinks, and the cost of eventually fixing the root cause increases. AI readiness often requires a period of deliberate remediation before a new capability is built on top, retiring pipelines that have outlived their original purpose, establishing definitions that multiple teams have actually agreed on, addressing quality issues rather than encoding workarounds.

This phase tends not to appear in project proposals, and it is often where timelines first slip. Organisations that accept it as necessary, scope for it explicitly, and treat it as foundational investment rather than delay are in a meaningfully different position when the model eventually launches.

Where to start

These five conditions do not need to be perfect before a project begins, but they do need to be assessed honestly. A data readiness review before the scope is set, covering which datasets are involved, who owns them, how they move, and what quality issues are known, is a much smaller investment than discovering the gaps twelve months into a project.

If you want to understand where your data actually stands before the next AI initiative, we are happy to work through that with you.

Let's talk.

Topics: AI and Machine Learning

Marc de Haas

Head of Development

Marc is Head of Development at Crystalloids and works closely on client projects as a Solution Architect and Cloud Architect.

With more than 30 years of experience, he designs and builds modern data platforms and scalable cloud architectures for data-centric and AI-driven solutions. He studied Computer Science at the...

View all articles →

Share this

What AI-Ready Data Actually Looks Like: A Practical Checklist

Clean data and knowing where it is not

Clear ownership per dataset

Documented pipelines

A mechanism for detecting when outputs degrade

Discipline around fixing before extending

Where to start

Marc de Haas

Share this

Sign up for our monthly newsletter.