,

The Legacy Code Archaeologist: Turning 15-Year-Old Code Bases Into Living Documentation

Every day in 2026, an average of 11,400 Americans turn 65 — and among them are thousands of senior developers, ops engineers, and sysadmins who carry decades of undocumented institutional knowledge about the systems that keep businesses running. A historic 4.18 million people are reaching retirement age this year alone, the largest wave in U.S. history. When they leave, they take with them the answers to questions nobody thought to write down: Why was this function written this way? What was the workaround for the 2014 vendor limitation? Which database table drives the nightly report that the CFO depends on?

The systems they built — or inherited, or hacked together at 2 AM to meet a shipment deadline — aren’t going anywhere. That VB.NET inventory app? It processes more orders this quarter than any new system ever could. Those COBOL batch jobs? They still balance the books. The Excel spreadsheet labeled FINAL_v3_Maybe in a shared drive nobody checks? It’s the de facto ERP for your logistics team.

The person who knows every hack, every workaround, and every undocumented “because I’m afraid to touch it” moment is looking at a golf bag. What happens when they walk out that door?

The Knowledge Cliff: Why This Is an Emergency (Not a Paperwork Exercise)

Let’s be explicit about the stakes. This isn’t about tidying up documentation. This is about the survival of the systems your business depends on.

Enterprises spend roughly 30% of their IT budgets just managing technical debt, with financial services sectors dedicating up to 39% to it (Protiviti Global Technology Executive Survey, 2025). The cost of maintenance alone averages $2,955,000 per organization annually (SnapLogic / Censuswide survey of 750 IT leaders, 2024). Meanwhile, developers waste an estimated 33% of their time on technical debt maintenance rather than building new capabilities (Deloitte Insights, 2024). And when key people leave, the damage multiplies: knowledge workers lose roughly 30% of their time just hunting for data and information (Forrester study).

The cost of unplanned downtime for SMBs runs $25,000 to $75,000 per hour (ITIC 2024 Hourly Cost of Downtime Report). When the only person who understands how the payment module talks to the legacy database takes a vacation — or a retirement party — that hour could be unplanned.

Meanwhile, three in five organizations in the UK public sector report that legacy systems are already blocking their AI adoption (Cloudhouse State of Technical Debt report, 2025). The same systems that are the backbone of the business are also the wall between the business and its future.

Even the 220 billion lines of COBOL still keeping 80% of in-person credit card sales and 95% of ATM transactions running worldwide (BizTech / CDW, April 2025) face a shrinking pool of maintainers: 91% of IT leaders want to expand their mainframe capabilities, but 71% say those teams are understaffed and 54% say they’re underfunded. This isn’t a niche problem. It’s the entire technology industry facing a knowledge cliff.

The gap between what the world depends on and what anyone actually understands is widening every single day. That’s not a maintenance problem. It’s a risk problem.

Why Vector Databases Fail at Legacy Code Documentation

Before we talk about what works, let’s talk about why the industry’s favorite answer doesn’t solve the legacy coding problem.

Vector databases and Retrieval-Augmented Generation have become the go-to pattern for connecting LLMs to documentation. And on paper, it sounds elegant: ingest everything, chunk it, embed it, query it. In practice, a vector DB approach to legacy code documentation fails at exactly the things legacy documentation needs most.

A vector DB retrieves isolated fragments on every query. Each time you ask a question, the LLM rediscovers everything from scratch. There’s no cross-page understanding. Each retrieved chunk is independent — it doesn’t know that the database table mentioned in the retrieved docstring is the same one described in a separate data dictionary page three chapters away. There’s no memory between queries. No synthesis. No accumulated understanding.

The architectural burden compounds the problem. A vector solution requires an embedding pipeline, a vector store, an orchestration layer, and continuous re-indexing as code changes. Context windows get exhausted with large codebases. And critically, nothing accumulates between sessions — the system doesn’t get smarter or more helpful over time. Every query starts at zero.

The real differentiator is far simpler: the tedious part of maintaining a knowledge base is neither the reading nor the thinking. It’s the bookkeeping. Cross-references, table of contents updates, section reorganization. LLMs don’t forget to update a cross-reference, and an agent can touch fifteen linked files in one pass without thinking about it.

Wiki pages are the infrastructure. They’re human-readable by default. They grow in value with each page added.

Karpathy’s LLM Wiki Pattern: What Makes It Perfect for Legacy Code Documentation

In April 2026, Andrej Karpathy published a GitHub gist that quickly accumulated over 5,000 stars: a simple pattern describing how an LLM can maintain a persistent, interlinked knowledge base by reading raw sources and incrementally building structured, cross-referenced documentation (Karpathy, “LLM Wiki,” GitHub Gist, April 4, 2026).

What makes this pattern revolutionary for the legacy code problem isn’t the technology. It’s the insight that the same LLM that can query your codebase can also build the knowledge base that makes that codebase queryable.

Karpathy’s pattern uses a three-layer architecture:

  1. Immutable raw sources — the original code, docs, tickets, emails, and vendor manuals. Never modified by the LLM.
  2. An LLM-maintained wiki — structured, interlinked markdown pages that the LLM reads and writes.
  3. A schema file (AGENTS.md or CLAUDE.md) — instructions that tell the LLM how to organize, name pages, maintain cross-references, and follow conventions.

The pattern spawned over 100 community implementations on GitHub within months. But almost all of those implementations target individual developers or small teams working on greenfield projects. Nobody has adapted this for the messy reality of SMB legacy systems: five different version control tools, tickets scattered across three platforms, vendor documentation in PDFs nobody reads, and a knowledge base that should help non-technical operations staff understand how the payment system actually works.

That’s exactly what we’re going to build. And this time, it’s not a gist. It’s a production architecture.

The Four-Layer System: What the Legacy Code Archaeologist Actually Looks Like

Here’s the architecture. It’s deliberately simple. Every layer exists for a reason.

Layer One: Raw Sources (The Immutable Ground Truth)

Everything that ever touched your legacy system. Code repositories (SVN, Git, Mercurial — each handled differently, the agent normalizes them). Design documents. Jira and ServiceNow ticket archives. Vendor manuals. Email threads about the system. Spreadsheet-based workflows. Operational runbooks. This layer is read-only. The wiki never modifies a single character of the source material. It’s the ground truth.

Layer Two: The Wiki (The Living Knowledge Base)

This is where knowledge actually lives. Using BookStack as the platform, the wiki is organized into books that map directly to what your business needs to know about its legacy estate:

  • System Anatomy — Every source file, module, table, stored procedure, and API endpoint with a summary of purpose, dependencies, and patterns.
  • Why We Did It This Way — Undocumented rationale extracted from tickets, emails, and code comments.
  • Operator Playbooks — Step-by-step runbooks for daily operations and emergency procedures.
  • Data Dictionary — Table schemas, view relationships, and spreadsheet mappings.
  • Migration Tracker — What’s been modernized, what’s next, what’s safe to leave alone.
  • Decision Log — Rationale behind every hack and workaround.

Each page is auto-generated by the LLM from raw sources, reviewed by human editors, and enriched over time with every new query and ingestion. The wiki compounds in value. Every page added makes every other page more useful.

Layer Three: The Schema (AGENTS.md — The Brain)

The schema file is the rulebook that tells local LLMs exactly how to operate the wiki. How to name pages. What structure each page follows. How to maintain cross-references between pages. How to flag contradictions between sources. How to run periodic lint passes that detect orphaned pages, broken links, and missing rationale. This is what separates a helpful wiki from a chaotic one.

Layer Four: Navigation (index.md + log.md)

Two small files handle navigation as the wiki grows toward hundreds of pages:

  • index.md — A content-oriented catalog organized by category. Every page with a link, one-line summary, and metadata (date, source count, review status). The wiki reads this first on every query to find relevant sections before diving deep. Works at moderate scale — around 100 sources and hundreds of pages — with zero infrastructure overhead.
  • log.md — Append-only chronological record of all activity. Ingestions, queries, lint passes. Consistently formatted for parsing with simple Unix tools.

Why BookStack: The Platform Choice That Makes This Actually Workable

Several wiki platforms exist. But for the legacy documentation problem, the platform choice determines whether the system dies in week one or scales for years. Here’s why BookStack, deployed via the linuxserver/docker-bookstack image, is the right choice — and why alternatives fail for SMBs.

BookStack’s book > chapter > page hierarchy maps directly to the natural organization of a legacy knowledge base. It’s the exact structure that operations managers, QA leads, and new hires intuitively understand without training.

The critical advantage for our use case? BookStack already has a proven MCP server integration. The ttpears/bookstack-mcp and pnocera/bookstack-mcp-server projects each provide 47+ tools for reading, creating, searching, and managing BookStack content programmatically. We’re starting from working integration, not rebuilding.

Other wiki platforms don’t offer this. Wiki.js, DokuWiki, and XWiki have no built-in MCP server ecosystem. For a system whose entire value comes from LLM-driven document generation and maintenance, that’s a blocker.

And for SMBs, the economics are decisive: BookStack is free and open-source. Run it on your existing server. No per-user licensing. No vendor lock-in. Your data — your code, your tickets, your documentation — stays entirely on-premise in MariaDB/MySQL. Zero cloud dependency. Zero external data transmission.

Confluence runs $7 per user per month and stores your most sensitive documentation in the cloud. Confluence’s complexity overwhelms non-technical staff who need to contribute. BookStack gives you enterprise wiki capability at zero cost with a learning curve measured in minutes, not weeks.

How It All Fits Together: The Complete Loop

The architecture connects your existing AI stack to your existing documentation chaos. Here’s the flow:

  1. Deploy BookStack on Docker — Using the linuxserver image in roughly 15 minutes. The wiki platform is running.
  2. Connect the BookStack MCP server — The existing 47+ tool MCP server hooks into your BookStack instance. Your local LLM can now read and write wiki pages programmatically.
  3. Build the codebase crawler MCP server — This reads your SVN, Git, or Mercurial repos, extracts files, modules, functions, and data schemas, and feeds structured information into the BookStack MCP server for auto-generation of wiki pages.
  4. Route queries through Open WebUI — Every question about the legacy system answers from the wiki, not from raw documents. The wiki is the ground truth, the LLM is the analyst, and the results are presented in a familiar chat interface.
  5. Non-technical staff contribute — Operations managers can add notes, flag outdated pages, and write runbooks directly in BookStack’s intuitive interface. No Markdown. No git commits. No developer dependency.

This isn’t a theoretical architecture. Every component already exists. The BookStack Docker image from linuxserver is battle-tested. The MCP servers are published and functional. The Open WebUI already connects to MCP servers via our existing tutorial stack. The knowledge pattern is validated by 5,000+ developers who’ve starred Karpathy’s gist. And the LLMs running locally provide the processing power.

What nobody has done — until now — is put these pieces together specifically for the SMB legacy problem.

Why This Only Gets Better Over Time

A vector database retrieves. A wiki compounds.

Every ingestion adds new pages and cross-references. Every query that uncovers a gap creates a new page to fill it. Every human edit in BookStack improves accuracy over the LLM’s initial draft. Every periodic lint pass catches orphaned links and missing rationale before they multiply.

According to Knowron’s own marketing material companies that invest in AI-assisted knowledge capture can minimize knowledge loss and shorten onboarding times by up to 40%. That’s not a marginal improvement. That’s the difference between a six-month onboarding cycle and a six-week one. Between a team that panics when someone leaves and a team that simply reassigns their pages.

The cost of inaction is not abstract. $2,955,000 per organization goes annually to legacy maintenance and updates alone (SnapLogic / Censuswide survey of 750 IT leaders across the US, UK, and Germany, 2024). Of those who spend their time wrestling with it, over 75% report spending 5 to 25 hours per week on legacy patches and updates (same SnapLogic / Censuswide survey, 2024). And 63% of organizations report that technical debt has reached moderate to severe impact on their data stacks (same source).

The wiki is the asset that pays dividends. Every query answered, every gap filled, every cross-reference maintained makes the system more resilient. And it costs nothing beyond what you’re already spending on your local AI infrastructure — a $3,000 server and a BookStack container.

Key Takeaways

Financial: The legacy knowledge base costs nothing in licensing, runs on existing $3,000 server infrastructure, and directly prevents unplanned downtime that ITIC values at $25,000 to $75,000 per hour for SMBs (ITIC, 2024).

Security: Zero cloud dependency. Your code, tickets, and documentation stay entirely on-premise. This is the anti-cloud knowledge capture system — and it’s free. BookStack runs on MariaDB/MySQL with no external transmission. Ever.

Strategic: The wiki only improves over time. Every ingestion, every query, every new page compounds in value. It’s a growing asset, not a one-time investment. And with 12% of small businesses now actively paying for generative AI (JP Morgan Chase Institute, 2025) while three in five organizations cite legacy systems as a blocker to AI adoption (Cloudhouse State of Technical Debt report, 2025), the competitive white space is enormous.

Call to Action

Your legacy code doesn’t care that the person who built it is retiring. It will keep running until someone breaks something they don’t understand — and then the cost stops being abstract.

This architecture doesn’t require new software, expensive consultants, or months of documentation sprints. It requires a $3,000 server, a BookStack container, and five hours of your team’s time to set up. The components you need are already published, already tested, and already working.

Deploy the wiki before the retirement wave takes the knowledge that can’t be rebuilt. Start with the systems that would bring your business to a halt if the documentation vanished today. Ingest those first. The rest can wait.

And if you want to see the technical deep-dive — exactly how to deploy BookStack on Docker, connect the MCP server, and build the codebase crawler — we’ve got that tutorial coming next. Because this isn’t theory. It’s the next post in the series.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *