Every enterprise AI system is only as good as the data that feeds it. Not the data you have, but the data you can actually get to the model, in the right format, at the right time. That's the job of a data pipeline, and most enterprises are building them wrong.
What You Need to Know
- Data pipelines for AI are fundamentally different from traditional ETL. They need to handle unstructured data, produce embeddings, maintain provenance, and operate continuously, not just move rows between databases.
- The biggest pipeline failures aren't engineering. They're organisational. Data locked in silos, inconsistent formats across departments, no ownership of data quality at the source.
- You don't need a perfect pipeline on day one. Start with one data source, one AI capability, and iterate. But design the architecture to scale from the start.
- Monitoring is not optional. A pipeline that silently degrades will poison your AI outputs without anyone noticing until users lose trust.
73%
of enterprise AI teams spend more time on data preparation than on model development
Source: Anaconda, State of Data Science Report, 2023
The Enterprise AI Data Pipeline
Traditional data pipelines move structured data from point A to point B. AI data pipelines need to do something fundamentally harder: take messy, unstructured enterprise knowledge and make it queryable by meaning.
Here's the architecture, stage by stage.
Stage 1: Ingestion
What You're Solving
Enterprise knowledge lives everywhere: SharePoint, Confluence, file shares, email archives, CRM notes, legacy databases. The first job is getting it all into one pipeline without losing context.
The Patterns That Work
Connector-based ingestion. Build or buy connectors for each source system. Each connector handles authentication, pagination, change detection, and format normalisation for its source. Don't try to build a universal connector. The APIs are too different.
Change detection over full sync. After the initial load, your pipeline should detect and process only what's changed. Full re-syncs are expensive and slow. Most source systems support webhooks, change feeds, or timestamp-based queries.
Metadata preservation. Every document entering the pipeline needs metadata: source system, author, creation date, last modified date, access permissions, document type. This metadata is critical for filtering, attribution, and access control downstream.
Common Mistakes
- Ignoring access control at ingestion. If document A is restricted to the legal team, that restriction must travel with the document through the entire pipeline. Bolt-on access control after the fact is fragile.
- Treating all sources equally. A policy document and a Slack thread require very different processing. Classify at ingestion, not later.
- No deduplication. The same document often exists in multiple source systems. Ingest it once. Hash-based deduplication works for exact copies; semantic deduplication handles near-duplicates.
Stage 2: Processing and Transformation
What You're Solving
Raw documents need to become AI-ready chunks: clean text, appropriately sized, with preserved structure and context.
The Patterns That Work
Document parsing. PDFs, Word documents, HTML pages, and slide decks all need different parsers. Invest in solid parsing. This is where most quality issues originate. Tools like Apache Tika, Unstructured.io, and cloud document intelligence APIs handle the heavy lifting.
Intelligent chunking. Split documents into chunks that preserve coherent ideas. The naive approach (split every N tokens) breaks mid-sentence and loses context. Better approaches:
- Semantic chunking. Split at paragraph or section boundaries
- Hierarchical chunking. Preserve document structure (heading to content relationships)
- Overlapping chunks. Include 10-20% overlap between chunks to maintain context at boundaries
Chunk Size Matters
For RAG applications, chunks of 200-500 tokens typically perform best. Too small and you lose context. Too large and you dilute relevance. Test with your actual documents; optimal size varies by content type.
Context enrichment. Each chunk should carry context beyond its own text. Prepend the document title, section heading, and key metadata. A chunk that says "The policy applies from 1 January 2024" is useless without knowing which policy it's from.
Common Mistakes
- Skipping table extraction. Tables in PDFs contain critical structured data that most parsers handle poorly. Invest in table-aware parsing or extract tables separately.
- Ignoring images and diagrams. Enterprise documents are full of charts, diagrams, and annotated images. Modern multimodal models can process these, so don't discard them.
- One-size-fits-all chunking. Legal contracts need different chunking than meeting notes. Build chunking strategies per document type.
Stage 3: Embedding and Storage
What You're Solving
Chunks need to become vectors stored in a database optimised for similarity search.
The Patterns That Work
Embedding model selection. This is your most consequential engineering decision. The embedding model determines retrieval quality. A mediocre model with a great database still produces mediocre results. As of mid-2024, strong options include OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source models like BGE and E5.
Batch processing with incremental updates. Initial embedding is a batch job. After that, embed new and changed chunks incrementally. Store the embedding model version alongside each vector. When you upgrade models, you'll need to re-embed.
Hybrid storage. Store vectors in a vector database for similarity search, and store the raw text in a standard database for keyword search and retrieval. This enables hybrid search, the combination that consistently outperforms either approach alone.
Namespace separation. Organise vectors by source, department, or access level. When a user queries the system, you can scope the search to only the namespaces they're authorised to access. This is simpler and faster than post-retrieval filtering.
Common Mistakes
- Not versioning embeddings. When you change your embedding model (and you will), old and new vectors aren't compatible. Track which model produced each vector.
- Embedding everything at once. Start with your highest-value documents. Embedding 100,000 documents and discovering your chunking strategy is wrong is expensive to fix.
Stage 4: Quality and Monitoring
What You're Solving
Data pipelines degrade silently. Source systems change their APIs. Document formats shift. Embedding quality drifts. Without monitoring, your AI gradually gets worse and nobody knows until users stop trusting it.
87%
of ML models experience performance degradation within the first year of deployment
Source: Gartner, Managing AI Model Risk, 2023
The Patterns That Work
Pipeline health monitoring. Track basic metrics at every stage: documents ingested per run, processing failures, chunks produced, embedding throughput. Alert on anomalies. A sudden drop in ingested documents usually means a connector is broken.
Retrieval quality testing. Maintain a test set of questions with known correct source documents. Run this test set regularly (weekly minimum) and track retrieval accuracy over time. When accuracy drops, diagnose whether the issue is in the data, the chunking, or the embedding.
Freshness monitoring. Track the age of your most recently indexed document per source. If your HR policy connector hasn't indexed anything in 30 days, something is wrong, even if the pipeline reports no errors.
Feedback loops. Give users a way to flag incorrect or irrelevant AI responses. Route these flags back to the pipeline team. User feedback is the most reliable signal of pipeline quality issues.
The Monitoring Dashboard
At minimum, track these metrics:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Documents ingested (per source, per day) | Connector health | >50% drop from baseline |
| Processing failure rate | Parser/transformation issues | >5% failure rate |
| Average chunk size | Chunking consistency | >20% deviation from target |
| Retrieval accuracy (test set) | End-to-end quality | >10% drop from baseline |
| Source freshness | Data currency | Configurable per source |
| User-flagged errors | Real-world quality | Trending upward |
Stage 5: Governance and Access Control
What You're Solving
Enterprise data has access restrictions for good reason. Your AI pipeline must respect those restrictions end-to-end, from ingestion through to the response the user sees.
The Patterns That Work
Source-level permissions. Carry document permissions from the source system into the pipeline. When a user queries the AI, filter results to only documents they're authorised to access.
Audit logging. Log every query, every retrieved document, and every generated response. This isn't optional for regulated industries, and it's good practice for everyone. The governance frameworks catching up to enterprise AI all require auditability.
Data retention policies. Define how long documents stay in the pipeline after they're deleted from the source system. Embedding a document that no longer exists in the source creates a ghost: the AI can reference information the organisation has deliberately removed.
Putting It Together
A well-built enterprise AI data pipeline is a living system, not a one-time build. It ingests continuously, processes intelligently, embeds accurately, and monitors relentlessly. It respects access control from end to end. And it gets better over time as you refine chunking strategies, upgrade embedding models, and respond to user feedback.
The most common mistake? Treating the pipeline as a one-off setup task. Teams build it, load their documents, and move on to the "exciting" AI work. Six months later, the pipeline is stale, the AI is returning outdated information, and users have quietly stopped using it.
Build the pipeline like infrastructure. Monitor it like a production system. Maintain it like the foundation it is.
- Should we build or buy our data pipeline?
- For most enterprises, a hybrid approach works best. Use managed services for connectors and document parsing (cloud providers offer these), but own the chunking, embedding, and quality monitoring layers. These are where your competitive advantage and data sovereignty requirements live. The commodity parts are connectable; the differentiating parts need your attention.
- How long does it take to build an enterprise AI data pipeline?
- A functional pipeline for a single data source and capability takes 2-4 weeks. A multi-source pipeline with proper monitoring and access control takes 6-12 weeks. The key variable isn't technology. It's data access. Getting API credentials, understanding source system quirks, and navigating data governance approvals typically takes longer than the engineering build.
- What's the minimum viable pipeline?
- One source system, document parsing, semantic chunking, embedding, vector storage, and basic retrieval. Skip hybrid search, skip monitoring dashboards, skip multi-source connectors. Get one end-to-end flow working and validate retrieval quality before adding complexity. You can build this in a week.

