Building an AI Content Pipeline for a Publishing Platform

The Problem

The editorial team was producing 8 articles per week. Each one took 4.2 hours on average: 1.5 hours of research, 2 hours of drafting, 45 minutes of editing, and 15 minutes of formatting and publishing. The math was brutal. To double output, you'd need to double headcount. To triple it, triple headcount. There was no leverage.

The content itself was good. The writers were skilled. The bottleneck wasn't talent — it was the mechanical, repeatable parts of the process: pulling sources, synthesizing research, producing a first draft that matched the publication's voice. These are exactly the tasks that language models are built for.

The brief was clear: build a pipeline that handles the mechanical work so writers can focus on what they're actually good at — judgment, voice, and editorial decisions. Don't replace the writers. Give them a 10x multiplier.

The Approach

The core architecture is a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking GPT-4o to generate content from scratch (which produces generic, hallucination-prone output), you first retrieve relevant source material from a curated knowledge base, then use that material as grounding context for generation.

This matters enormously for a publishing platform. Readers expect accuracy. A model that confidently fabricates statistics is worse than no model at all. RAG gives you a way to constrain generation to verified sources while still getting the fluency and speed of a large language model.

The pipeline has five stages: ingestion, embedding, retrieval, generation, and human review. Each stage is a discrete service. They communicate through a job queue backed by Redis, which means any stage can be scaled independently and failures don't cascade.

Architecture

Stage 1: Ingestion

Source material comes in from three channels: RSS feeds from industry publications, manual uploads (PDFs, Word docs), and a web scraper for specific domains. Each document goes through a preprocessing step: HTML stripping, encoding normalization, and deduplication via content hash.

The chunking strategy here was non-trivial. Naive chunking by character count destroys semantic coherence — you end up splitting sentences mid-thought, which degrades retrieval quality. I settled on a recursive character text splitter with a 512-token chunk size and 64-token overlap. The overlap ensures that context at chunk boundaries isn't lost. For structured documents (PDFs with clear section headers), I used a header-aware splitter that respects document structure.

Stage 2: Embedding

Each chunk gets embedded using OpenAI's text-embedding-3-small model and stored in PostgreSQL with the pgvector extension. The choice of text-embedding-3-small over text-embedding-3-large was deliberate: the quality difference for retrieval tasks is marginal, but the cost difference is 5x. At scale, that matters.

The vector index uses HNSW (Hierarchical Navigable Small World) rather than IVFFlat. HNSW has better recall at the cost of slightly higher memory usage. For a knowledge base of this size (roughly 40,000 chunks at peak), the memory overhead was acceptable and the recall improvement was measurable.

Stage 3: Retrieval

When a writer starts a new article, they provide a topic and a set of seed keywords. The retrieval stage embeds the query and runs a cosine similarity search against the vector store, returning the top 20 chunks. These chunks are then re-ranked using a cross-encoder model (a smaller BERT-based model fine-tuned for relevance scoring) to get the top 8.

The two-stage retrieval approach — embedding search followed by cross-encoder re-ranking — consistently outperforms single-stage retrieval. The embedding model is fast but imprecise; the cross-encoder is slow but accurate. Using them in sequence gives you the best of both.

Stage 4: Generation

The retrieved chunks, along with the topic and any writer-provided notes, get assembled into a structured prompt. The prompt engineering here took the most iteration. Early versions produced technically accurate but tonally inconsistent output — the model would drift between formal and casual registers mid-article.

The fix was a detailed system prompt that included: the publication's style guide (condensed to ~800 tokens), three example articles with annotations explaining specific voice choices, and explicit instructions about what to avoid (passive voice, hedging language, unsupported superlatives). The system prompt is versioned in the database alongside the generated content, so you can trace exactly which prompt version produced each draft.

GPT-4o handles the generation. The context window is large enough to fit all retrieved chunks plus the full system prompt without truncation. Temperature is set to 0.4 — low enough for consistency, high enough to avoid robotic repetition.

Stage 5: Human Review

The generated draft lands in a custom editorial UI built in Next.js. Writers see the draft alongside the source chunks that informed each section (highlighted inline). They can accept, edit, or reject any paragraph. Rejected paragraphs trigger a regeneration request with additional context.

The review UI tracks edit distance between the AI draft and the final published article. This data feeds back into prompt refinement — sections with high edit distance indicate where the model is consistently missing the mark.

Key Technical Decisions

Why LangChain Over Raw API Calls

LangChain gets a lot of criticism, and some of it is fair. The abstraction layer adds complexity. But for a pipeline with multiple stages, retrieval logic, and prompt chaining, the alternative is writing a lot of boilerplate. LangChain's ConversationalRetrievalChain handled the retrieval-augmented generation pattern cleanly, and its callback system made it straightforward to add logging and tracing without instrumenting every API call manually.

The decision point: if you're building a single-purpose chatbot, skip LangChain. If you're building a multi-stage pipeline with retrieval, memory, and tool use, the abstraction pays for itself.

Chunking Strategy

The 512-token chunk size with 64-token overlap wasn't arbitrary. I tested chunk sizes from 256 to 1024 tokens and measured retrieval precision on a held-out evaluation set. 512 tokens hit the sweet spot: large enough to contain a complete thought, small enough that retrieved chunks are topically focused. Larger chunks improved recall but hurt precision — you'd retrieve the right document but with a lot of irrelevant surrounding content.

Guardrails for Factual Accuracy

The pipeline includes a post-generation verification step. After the draft is generated, a second LLM call checks each factual claim against the retrieved source chunks. Claims that can't be grounded in the sources are flagged for writer review. This isn't foolproof — the verifier can miss things — but it catches the most egregious hallucinations before they reach a human.

The verification step adds roughly 45 seconds to the pipeline. Worth it.

Results

After a 2-week ramp-up period where writers learned the tool and we tuned the prompts, the numbers stabilized.

Monthly Content Output

Content output went from 8 articles per week to an average of 27 per week in the third month after launch. The team didn't grow. The same writers, producing 3.4x the content, with quality scores (measured by editor review) holding at 94% of pre-pipeline baseline.

Time Per Article by Stage (hours)

The time breakdown tells the real story. Research dropped from 1.5 hours to 12 minutes — the retrieval stage surfaces relevant sources instantly. Drafting dropped from 2 hours to 18 minutes — writers are editing and refining rather than writing from scratch. Editing time actually stayed relatively high at 30 minutes, which is expected: the AI draft needs a human voice applied to it. Publishing dropped from 18 minutes to 6 minutes because the pipeline handles formatting automatically.

Total time per article: 4.2 hours before, 66 minutes after. That's a 74% reduction in time-per-article, which translates directly to the 340% increase in output.

What Worked

The two-stage retrieval. The cross-encoder re-ranking step was the single biggest quality improvement. Without it, the retrieved chunks were topically relevant but not always the most useful. With it, the model consistently gets the most informative sources.

Versioned prompts. Treating the system prompt as a versioned artifact — stored in the database, linked to every generated draft — made iteration disciplined. You can A/B test prompt changes and measure their effect on edit distance. Without versioning, prompt engineering is guesswork.

The editorial UI. Writers adopted the tool faster than expected because the UI made the AI's reasoning transparent. Seeing which source chunks informed which paragraphs built trust. Writers weren't accepting a black box — they were reviewing a sourced draft.

What I'd Reconsider

The Redis job queue. For the scale of this project, Redis was overkill. A simple PostgreSQL-backed queue (using pg-boss or similar) would have been simpler to operate and debug. Redis added an infrastructure dependency that required monitoring and occasional intervention. The complexity wasn't justified by the throughput requirements.

Chunk overlap strategy. The fixed 64-token overlap works, but a smarter approach would use sentence boundaries. Splitting mid-sentence and relying on overlap to recover context is a hack. A sentence-aware chunker would produce cleaner chunks and likely improve retrieval precision by another few percentage points.

The verification step latency. 45 seconds of post-generation verification is noticeable. Writers sit and wait. A better architecture would run verification asynchronously and surface flagged claims in the editorial UI without blocking the draft from appearing. The draft could load immediately with a "verifying sources" indicator, and flags would appear as the verification completes.

Built with: Next.js · OpenAI GPT-4o · LangChain · PostgreSQL + pgvector · Redis · Vercel