How We Built a Knowledge Graph for a 10,000-Page Document Library
A real case study: ingesting a decade of enterprise documents into a queryable knowledge graph. The technical decisions, failure modes, and what the system can now answer that no search engine could.
This is the story of a specific project: a professional services firm with roughly 10,000 documents spanning a decade of client engagements, internal policies, regulatory filings, and training materials. They had a SharePoint. They had people who knew where things were. They had a search bar that returned filenames. The operational problem was that knowledge lived in people's heads, and when those people left, the knowledge went with them.
The first decision was schema design. A knowledge graph is only as useful as the entity types and relationship types it models. We spent two weeks before writing a line of extraction code doing entity taxonomy workshops — what are the nouns in this organization's world that matter for answering questions? The answer was: clients, projects, deliverables, regulatory frameworks, clauses (in contracts and policies), personnel, dates, and monetary values. The relationship types followed: client has project, project produced deliverable, deliverable references regulatory framework, clause binds client, personnel worked on project.
The extraction pipeline runs in three stages. In the first stage, documents go through a classification step that assigns each document to a type (contract, policy, report, proposal, email thread) and extracts document-level metadata (date, parties, subject). This uses a relatively simple classifier that we fine-tuned on 200 labeled examples. Document type matters because different types have different entity extraction strategies — a contract extraction prompt is different from a policy extraction prompt.
The second stage is entity extraction. For each classified document, we run a structured extraction pass that pulls out entities of the types defined in the schema. This is where most of the engineering complexity lives. LLMs are inconsistent about entity naming — the same client might be referred to as "Acme Corp", "Acme Corporation", "Acme", and "the client" across 50 documents. The extraction prompt forces canonical naming by providing the list of known entities as context, but this only works once you have a seed set of canonical entities to provide. We bootstrapped this with a pass over the 100 most recent documents and then ran increasingly complete canonicalization passes.
The third stage is relationship extraction. For each pair of co-occurring entities within a document or paragraph, we ask the LLM to classify the relationship type and extract supporting evidence. This is more expensive than entity extraction because it's O(n²) in the number of entities per document. We restrict it to entity pairs that co-occur within a 500-token window, which catches most meaningful relationships while keeping the cost tractable.
The adjudication layer — normalizing, deduplicating, and resolving conflicts across extractions — took more time than the extraction itself. When two documents say different things about the same relationship, which version is canonical? We implemented a recency-weighted conflict resolution: more recent documents win, with human review flagged for high-confidence contradictions. This required building a simple review UI that the client's team used to adjudicate about 400 conflicts over two weeks.
The query layer is where the investment pays off. The questions the system can now answer include: "Which clients have active contracts that reference SEBI circular X?" — a cross-document traversal that took hours of manual lookup before. "Which projects involved both regulatory compliance work and technology implementation?" — a multi-hop query across project, deliverable, and engagement type nodes. "What are all the contractual obligations that expire in Q3 2025?" — a temporal query across the clause and date subgraph.
The system now handles roughly 200 queries per week from a team of 40 professionals. The queries that previously took 30-90 minutes of manual search now return in under 10 seconds. The harder-to-quantify benefit is the queries that no one asked before because the manual effort would have been prohibitive — cross-client pattern analysis, regulatory exposure mapping, historical precedent lookup — that are now part of how the team operates.