add mini-context-graph skill (#1580)

* add mini-context-graph skill

* remove pycache files

* filename case update to SKILL.md

* update readme
This commit is contained in:
Nixon Kurian
2026-05-05 09:34:37 +05:30
committed by GitHub
parent 1f96bce626
commit 746ba555b6
16 changed files with 2343 additions and 0 deletions
+1
View File
@@ -230,6 +230,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
| [microsoft-skill-creator](../skills/microsoft-skill-creator/SKILL.md)<br />`gh skills install github/awesome-copilot microsoft-skill-creator` | Create agent skills for Microsoft technologies using Learn MCP tools. Use when users want to create a skill that teaches agents about any Microsoft technology, library, framework, or service (Azure, .NET, M365, VS Code, Bicep, etc.). Investigates topics deeply, then generates a hybrid skill storing essential knowledge locally while enabling dynamic deeper investigation. | `references/skill-templates.md` | | [microsoft-skill-creator](../skills/microsoft-skill-creator/SKILL.md)<br />`gh skills install github/awesome-copilot microsoft-skill-creator` | Create agent skills for Microsoft technologies using Learn MCP tools. Use when users want to create a skill that teaches agents about any Microsoft technology, library, framework, or service (Azure, .NET, M365, VS Code, Bicep, etc.). Investigates topics deeply, then generates a hybrid skill storing essential knowledge locally while enabling dynamic deeper investigation. | `references/skill-templates.md` |
| [migrating-oracle-to-postgres-stored-procedures](../skills/migrating-oracle-to-postgres-stored-procedures/SKILL.md)<br />`gh skills install github/awesome-copilot migrating-oracle-to-postgres-stored-procedures` | Migrates Oracle PL/SQL stored procedures to PostgreSQL PL/pgSQL. Translates Oracle-specific syntax, preserves method signatures and type-anchored parameters, leverages orafce where appropriate, and applies COLLATE "C" for Oracle-compatible text sorting. Use when converting Oracle stored procedures or functions to PostgreSQL equivalents during a database migration. | None | | [migrating-oracle-to-postgres-stored-procedures](../skills/migrating-oracle-to-postgres-stored-procedures/SKILL.md)<br />`gh skills install github/awesome-copilot migrating-oracle-to-postgres-stored-procedures` | Migrates Oracle PL/SQL stored procedures to PostgreSQL PL/pgSQL. Translates Oracle-specific syntax, preserves method signatures and type-anchored parameters, leverages orafce where appropriate, and applies COLLATE "C" for Oracle-compatible text sorting. Use when converting Oracle stored procedures or functions to PostgreSQL equivalents during a database migration. | None |
| [minecraft-plugin-development](../skills/minecraft-plugin-development/SKILL.md)<br />`gh skills install github/awesome-copilot minecraft-plugin-development` | Use this skill when building or modifying Minecraft server plugins for Paper, Spigot, or Bukkit, including plugin.yml setup, commands, listeners, schedulers, player state, team or arena systems, persistent progression, economy or profile data, configuration files, Adventure text, and version-safe API usage. Trigger for requests like "build a Minecraft plugin", "add a Paper command", "fix a Bukkit listener", "create plugin.yml", "implement a minigame mechanic", "add a perk or quest system", or "debug server plugin behavior". | `references/bootstrap-registration.md`<br />`references/build-test-and-runtime-validation.md`<br />`references/config-data-and-async.md`<br />`references/maps-heroes-and-feature-modules.md`<br />`references/minigame-instance-flow.md`<br />`references/persistent-progression-and-events.md`<br />`references/project-patterns.md`<br />`references/state-sessions-and-phases.md` | | [minecraft-plugin-development](../skills/minecraft-plugin-development/SKILL.md)<br />`gh skills install github/awesome-copilot minecraft-plugin-development` | Use this skill when building or modifying Minecraft server plugins for Paper, Spigot, or Bukkit, including plugin.yml setup, commands, listeners, schedulers, player state, team or arena systems, persistent progression, economy or profile data, configuration files, Adventure text, and version-safe API usage. Trigger for requests like "build a Minecraft plugin", "add a Paper command", "fix a Bukkit listener", "create plugin.yml", "implement a minigame mechanic", "add a perk or quest system", or "debug server plugin behavior". | `references/bootstrap-registration.md`<br />`references/build-test-and-runtime-validation.md`<br />`references/config-data-and-async.md`<br />`references/maps-heroes-and-feature-modules.md`<br />`references/minigame-instance-flow.md`<br />`references/persistent-progression-and-events.md`<br />`references/project-patterns.md`<br />`references/state-sessions-and-phases.md` |
| [mini-context-graph](../skills/mini-context-graph/SKILL.md)<br />`gh skills install github/awesome-copilot mini-context-graph` | A persistent, compounding knowledge base combining Karpathy's LLM Wiki pattern<br />with a structured knowledge graph. Ingest documents once — the LLM writes wiki<br />pages, extracts entities/relations into the graph, and stores raw content for<br />evidence retrieval. Knowledge accumulates and cross-references; it is never<br />re-derived from scratch. | `references/ingestion.md`<br />`references/lint.md`<br />`references/ontology.md`<br />`references/retrieval.md`<br />`scripts/config.py`<br />`scripts/contextgraph.py`<br />`scripts/template_agent_workflow.py`<br />`scripts/tools` |
| [mkdocs-translations](../skills/mkdocs-translations/SKILL.md)<br />`gh skills install github/awesome-copilot mkdocs-translations` | Generate a language translation for a mkdocs documentation stack. | None | | [mkdocs-translations](../skills/mkdocs-translations/SKILL.md)<br />`gh skills install github/awesome-copilot mkdocs-translations` | Generate a language translation for a mkdocs documentation stack. | None |
| [model-recommendation](../skills/model-recommendation/SKILL.md)<br />`gh skills install github/awesome-copilot model-recommendation` | Analyze chatmode or prompt files and recommend optimal AI models based on task complexity, required capabilities, and cost-efficiency | None | | [model-recommendation](../skills/model-recommendation/SKILL.md)<br />`gh skills install github/awesome-copilot model-recommendation` | Analyze chatmode or prompt files and recommend optimal AI models based on task complexity, required capabilities, and cost-efficiency | None |
| [msstore-cli](../skills/msstore-cli/SKILL.md)<br />`gh skills install github/awesome-copilot msstore-cli` | Microsoft Store Developer CLI (msstore) for publishing Windows applications to the Microsoft Store. Use when asked to configure Store credentials, list Store apps, check submission status, publish submissions, manage package flights, set up CI/CD for Store publishing, or integrate with Partner Center. Supports Windows App SDK/WinUI, UWP, .NET MAUI, Flutter, Electron, React Native, and PWA applications. | None | | [msstore-cli](../skills/msstore-cli/SKILL.md)<br />`gh skills install github/awesome-copilot msstore-cli` | Microsoft Store Developer CLI (msstore) for publishing Windows applications to the Microsoft Store. Use when asked to configure Store credentials, list Store apps, check submission status, publish submissions, manage package flights, set up CI/CD for Store publishing, or integrate with Partner Center. Supports Windows App SDK/WinUI, UWP, .NET MAUI, Flutter, Electron, React Native, and PWA applications. | None |
+194
View File
@@ -0,0 +1,194 @@
---
name: mini-context-graph
description: |
A persistent, compounding knowledge base combining Karpathy's LLM Wiki pattern
with a structured knowledge graph. Ingest documents once — the LLM writes wiki
pages, extracts entities/relations into the graph, and stores raw content for
evidence retrieval. Knowledge accumulates and cross-references; it is never
re-derived from scratch.
---
# Mini Context Graph Skill
## The Core Idea
Standard RAG re-discovers knowledge from scratch on every query. This skill is different:
1. **Wiki layer** — The LLM writes and maintains persistent markdown pages (summaries, entity pages, topic syntheses). Cross-references are already there. The wiki gets richer with every ingest.
2. **Graph layer** — Entities and relations are extracted once and stored as a navigable knowledge graph. BFS traversal answers structural queries without re-reading sources.
3. **Raw source layer** — Original documents are stored immutably with chunks. Provenance links tie every graph node and edge back to the exact text that supports it.
> The LLM writes; the Python tools handle all bookkeeping.
---
## Three Layers
| Layer | Where | What the LLM does | What Python does |
|-------|-------|-------------------|-----------------|
| **Raw Sources** | `data/documents.json` | Reads (never modifies) | Stores chunks + metadata |
| **Wiki** | `wiki/` (markdown) | Writes/updates pages | Manages index.md + log.md |
| **Graph** | `data/graph.json` | Extracts entities + relations | Persists, deduplicates, traverses |
---
## ⚡ Quick Start for Agents
```python
from scripts.contextgraph import ContextGraphSkill
from scripts.tools import wiki_store
skill = ContextGraphSkill()
# ===== INGEST WITH FULL RAG + WIKI =====
# 1. Read references/ingestion.md and references/ontology.md first
# 2. Extract entities and relations (LLM reasoning step)
entities = [
{"name": "memory leak", "type": "issue", "supporting_text": "memory leaks cause crashes"},
{"name": "system crash", "type": "issue", "supporting_text": "system crashes due to memory leaks"},
]
relations = [
{"source": "memory leak", "target": "system crash", "type": "causes",
"confidence": 1.0, "supporting_text": "System crashes due to memory leaks."},
]
result = skill.ingest_with_content(
doc_id="doc_001",
title="System Crash Analysis",
source="/docs/incident_report.pdf",
raw_content="System crashes due to memory leaks. Memory leaks occur when objects are not released.",
entities=entities,
relations=relations,
)
# result = {"doc_id": "doc_001", "chunk_count": 1, "nodes_added": 2, "edges_added": 1}
# 3. Write a wiki summary page for this document
wiki_store.write_page(
category="summary",
title="System Crash Analysis Summary",
content="""---
title: System Crash Analysis
source_document: doc_001
tags: [summary, incident]
---
# System Crash Analysis
**Source:** incident_report.pdf
## Key Claims
- [[memory-leak]] causes [[system-crash]] (confidence: 1.0)
## Entities
- [[memory-leak]] (issue)
- [[system-crash]] (issue)
""",
summary="Incident report: memory leaks cause system crashes.",
)
# ===== QUERY WITH EVIDENCE =====
result = skill.query_with_evidence("Why does the system crash?")
# Returns: {"query": ..., "subgraph": ..., "supporting_documents": [...], "evidence_chain": ...}
# ===== WIKI SEARCH (read wiki before answering) =====
pages = wiki_store.search_wiki("memory leak")
# Returns: [{slug, category, path, snippet}, ...]
```
---
## Operations
### Ingest
When a user provides a new document:
1. Read `references/ingestion.md` — entity/relation extraction rules.
2. Read `references/ontology.md` — type normalization rules.
3. Extract entities and relations using your LLM reasoning.
4. Call `skill.ingest_with_content(...)` — stores raw content + chunks + graph nodes + provenance.
5. **Write a wiki summary page** using `wiki_store.write_page(category="summary", ...)`.
6. **Update entity pages** — for each new/updated entity, write or update `wiki_store.write_page(category="entity", ...)`.
7. **Update topic pages** if the document touches an existing synthesis topic.
8. A single document ingest will typically touch 310 wiki pages.
### Query
When a user asks a question:
1. **Check the wiki first**`wiki_store.search_wiki(query)` to find relevant pages. Read them.
2. If the wiki has a good answer, synthesize from wiki pages (fast path).
3. If deeper graph traversal is needed, call `skill.query_with_evidence(query)`.
4. Return the answer with evidence citations from `supporting_documents`.
5. If the answer is valuable, file it back as a new wiki topic page.
### Lint
Periodically health-check the wiki:
```python
from scripts.tools import wiki_store
issues = wiki_store.lint_wiki()
# Returns: {orphan_pages, missing_pages, broken_wikilinks, isolated_pages}
```
Ask the LLM to review and fix: broken links, orphan pages, stale claims, missing cross-references. See `references/lint.md` for full lint workflow.
---
## Ingestion Constraints
- ❌ Do NOT hallucinate entities not present in the text
- ❌ Do NOT add relations without explicit textual evidence
- ❌ Do NOT add edges with confidence < 0.6
- ✅ Provide `supporting_text` for every entity and relation — this enables provenance
- ✅ Write a wiki summary page for every ingested document
- ✅ Update existing entity pages when new information arrives
- ✅ Flag contradictions in wiki pages when new data conflicts with old claims
---
## Retrieval Constraints
- 🔒 Traversal depth MUST NOT exceed 2 (config: MAX_GRAPH_DEPTH)
- 🔒 Only edges with confidence ≥ 0.6 (config: MIN_CONFIDENCE)
- 🔒 Maximum 50 nodes returned (config: MAX_NODES)
- ❌ Do NOT fabricate nodes or edges not in the graph
---
## Full Python API Reference
| Method | Purpose | When to Use |
|--------|---------|-------------|
| `skill.ingest_with_content(doc_id, title, source, raw_content, entities, relations)` | Full RAG ingest: raw docs + graph + provenance | Every new document |
| `skill.add_node(name, node_type)` | Add single entity (no provenance) | Quick additions without a source doc |
| `skill.add_edge(source_name, target_name, relation, confidence)` | Add single relation | Quick additions without a source doc |
| `skill.query(query)` | Graph-only retrieval → subgraph | Structural queries |
| `skill.query_with_evidence(query)` | Graph + provenance → subgraph + source chunks | Queries requiring citations |
| `wiki_store.write_page(category, title, content, summary)` | Write/update a wiki page | After every ingest; after answering queries |
| `wiki_store.read_page(category, title)` | Read a wiki page | Before answering; for cross-referencing |
| `wiki_store.search_wiki(query)` | Keyword search across wiki | Fast path before graph traversal |
| `wiki_store.list_pages(category)` | List all wiki pages | Getting an overview |
| `wiki_store.get_log(last_n)` | Read recent operations | Understanding wiki history |
| `wiki_store.lint_wiki()` | Health check | Periodic maintenance |
| `documents_store.list_documents()` | List all ingested raw sources | Audit / provenance checking |
| `documents_store.search_chunks(query)` | Chunk-level search | Finding specific evidence |
---
## Design Philosophy
> "The wiki is a persistent, compounding artifact. The cross-references are already there. The synthesis already reflects everything you've read." — Karpathy
| Layer | What Happens | Who Owns It |
|-------|-----------|-------------|
| **LLM Reasoning** | Extraction, synthesis, writing wiki pages | Agent (.md guidance files) |
| **Wiki Persistence** | Index, log, file I/O | `wiki_store.py` |
| **Graph Persistence** | Dedup, index, BFS traverse | `graph_store.py`, `retrieval_engine.py` |
| **Raw Source Storage** | Immutable docs + chunks + provenance | `documents_store.py` |
The human curates sources and asks questions. The LLM writes the wiki, extracts the graph, and answers with citations. Python handles all bookkeeping.
@@ -0,0 +1,196 @@
# Ingestion Instructions
This file defines how the agent extracts entities and relations from a raw document.
---
## Step 1: Read the Document
Read the provided text carefully. Identify:
- **Entities**: noun phrases that refer to real-world objects, systems, components, actors, concepts, or events.
- **Relations**: verb phrases that describe how one entity affects, contains, causes, uses, or is related to another.
---
## Step 2: Extract Entities
For each entity:
- Record its **name** (normalized: lowercase, strip leading/trailing whitespace)
- Assign a **type**: a short label (13 words) that categorizes the entity
### Entity Type Examples
| Entity Name | Suggested Type |
|-------------|---------------|
| Python interpreter | software |
| memory leak | issue |
| operating system | system |
| database | infrastructure |
| user | actor |
| API endpoint | interface |
| server | infrastructure |
**Rules:**
- Types must be general enough to reuse across documents
- Do NOT create unique types per entity (e.g., avoid `python-interpreter-type`)
- Use `ontology.md` normalization rules to canonicalize types
---
## Step 3: Extract Relations
For each pair of entities with an explicit connection in the text:
- Record the **source** entity name
- Record the **target** entity name
- Record the **relation type**: a verb or verb phrase (normalized: lowercase)
- Assign a **confidence** score between 0 and 1:
- 1.0 = stated explicitly ("A causes B")
- 0.8 = strongly implied ("A is linked to B")
- 0.6 = weakly implied ("A may affect B")
- < 0.6 = do NOT include
---
## Step 4: Output Format
Produce a JSON object in this exact format:
```json
{
"entities": [
{ "name": "entity name", "type": "entity type", "supporting_text": "exact quote mentioning this entity" }
],
"relations": [
{
"source": "source entity name",
"target": "target entity name",
"type": "relation type",
"confidence": 0.9,
"supporting_text": "exact quote that justifies this relation"
}
]
}
```
The `supporting_text` field is **required for provenance**. It must be a verbatim or near-verbatim quote from the document that mentions or supports the entity/relation. This is what links graph nodes and edges back to their source.
---
## Rules
- All names and types must be **lowercase**
- Only include relations where **both entities** are present in the entities list
- Do NOT invent entities or relations not supported by the text
- Prefer **reusing existing entity and relation types** from the ontology over creating new ones
- One entity can appear in multiple relations (as source or target)
- Always include `supporting_text` — this enables evidence retrieval and audit trails
---
## Step 5: Write Wiki Pages (Required)
After calling `skill.ingest_with_content(...)`, you MUST write wiki pages:
### 5a. Write a summary page for the document
```python
from scripts.tools import wiki_store
wiki_store.write_page(
category="summary",
title=f"{title} Summary",
content=f"""---
title: {title}
source_document: {doc_id}
tags: [summary]
---
# {title}
**Source:** {source}
## Key Claims
{chr(10).join(f'- [[{r["source"].replace(" ", "-")}]] {r["type"]} [[{r["target"].replace(" ", "-")}]] (confidence: {r["confidence"]})' for r in relations)}
## Entities
{chr(10).join(f'- [[{e["name"].replace(" ", "-")}]] ({e["type"]})' for e in entities)}
## Open Questions
- (Add questions from reading the document here)
""",
summary=f"Summary of {title}",
)
```
### 5b. Write or update entity pages
For each **new** entity not already in the wiki, write an entity page:
```python
wiki_store.write_page(
category="entity",
title=entity_name,
content=f"""---
title: {entity_name}
type: {entity_type}
source_document: {doc_id}
tags: [{entity_type}]
---
# {entity_name}
(Description from the document or prior knowledge.)
## Relations
(List any wikilinks to related entities extracted from relations.)
## Mentioned in
- [[{doc_id}-summary]]
""",
summary=f"{entity_name}: {entity_type}",
)
```
For **existing** entity pages, read the current page and append new information, updated relations, or flag contradictions.
---
## Example
**Input document:**
```
System crashes due to memory leaks.
Memory leaks occur when objects are not released.
```
**Expected extraction output:**
```json
{
"entities": [
{ "name": "system crash", "type": "issue", "supporting_text": "system crashes due to memory leaks" },
{ "name": "memory leak", "type": "issue", "supporting_text": "memory leaks occur when objects are not released" },
{ "name": "object", "type": "component", "supporting_text": "objects are not released" }
],
"relations": [
{
"source": "memory leak",
"target": "system crash",
"type": "causes",
"confidence": 1.0,
"supporting_text": "System crashes due to memory leaks."
},
{
"source": "object",
"target": "memory leak",
"type": "contributes to",
"confidence": 0.9,
"supporting_text": "Memory leaks occur when objects are not released."
}
]
}
```
@@ -0,0 +1,163 @@
# Lint Instructions
This file defines the wiki health-check workflow.
Run this periodically (or after a large batch of ingests) to keep the wiki
clean and accurate. The pattern is from Karpathy's LLM Wiki: detect contradictions,
orphans, broken links, stale claims, and data gaps.
---
## When to Run
- After ingesting 5+ documents
- When the user asks "check the wiki" or "health check"
- When answers seem inconsistent or contradictory
- Before a major synthesis or presentation
---
## Step 1: Run the Automated Health Check
```python
from scripts.tools import wiki_store
issues = wiki_store.lint_wiki()
# Returns:
# {
# "orphan_pages": [list of slugs in files but not in index],
# "missing_pages": [list of slugs in index but file deleted],
# "broken_wikilinks": {slug: [broken link targets]},
# "isolated_pages": [slugs with no wikilinks at all],
# }
```
---
## Step 2: Triage Each Issue Type
### Orphan Pages
Pages exist on disk but are not in the index. They are invisible to search.
**Fix**: Add them to the index or delete if stale.
```python
# To add to index, re-write the page (this auto-updates the index):
wiki_store.write_page(category="...", title="...", content=existing_content)
# To delete (manual step — confirm with user first):
# rm wiki/{category}/{slug}.md
```
### Missing Pages
In the index but the file was deleted. Dangling references.
**Fix**: Either recreate the page from knowledge or remove from index.
### Broken Wikilinks
`[[slug]]` references that point to pages that don't exist.
**Fix**: Create the missing page, or correct the link.
### Isolated Pages
Pages with no `[[wikilinks]]` — they are unreachable via link traversal.
**Fix**: Add links from/to related pages.
---
## Step 3: Check for Contradictions
Read the wiki index and scan for pages that might contradict each other:
```python
pages = wiki_store.list_pages()
# Returns [{slug, category, summary, date}, ...]
```
Look for:
- Same entity with conflicting `type` in different pages
- Same relation with different direction in different pages
- Newer ingests that update/supersede older claims
**When you find a contradiction:**
- Add a `## Contradictions` section to the relevant entity/topic pages:
```markdown
## Contradictions
- doc_001 says X; doc_003 says not-X — unresolved
```
- Flag it in the log:
```python
# Handled by wiki_store.write_page which auto-appends to log.md
```
---
## Step 4: Check for Stale Claims
Review pages ingested more than N days ago (use the `date` field from the index).
Ask: "Has any newer document superseded this claim?"
**When a claim is stale:**
- Update the page: add a `## Superseded` section or update the body.
- Mark the old claim with _(superseded by [[newer-doc-summary]])_.
---
## Step 5: Check for Missing Cross-References
For each entity page, check: does it link back to all summary pages that mention it?
For each summary page, check: does it link to all entity pages it extracted?
**Fix**: Read the page and add missing `[[slug]]` links.
---
## Step 6: Identify Data Gaps
Review entity pages that lack:
- A proper description (just a stub)
- Any `## Relations` section
- Any `## Mentioned in` links
These are candidates for deeper research or new ingests.
---
## Step 7: Log the Lint Pass
```python
# wiki_store.write_page automatically logs the activity.
# For a manual lint summary, append to log.md via write_page on a topic:
wiki_store.write_page(
category="topic",
title="Lint Pass YYYY-MM-DD",
content="# Lint Pass\n\n## Issues Found\n\n...\n\n## Fixed\n\n...",
summary="Lint pass results",
)
```
---
## Quick Lint Commands
```python
from scripts.tools import wiki_store
# Full health check
issues = wiki_store.lint_wiki()
# Get recent history
log = wiki_store.get_log(last_n=10)
# List all pages
all_pages = wiki_store.list_pages()
# Search for a concept across wiki
results = wiki_store.search_wiki("memory leak")
```
---
## Rules
- NEVER delete pages without user confirmation
- NEVER auto-resolve a contradiction — flag it for human review
- File all lint results as a topic page in the wiki (so the history is visible)
- Prefer adding cross-references over rewriting existing content
@@ -0,0 +1,99 @@
# Ontology Instructions
This file defines the rules for maintaining and evolving the dynamic ontology used by the Context Graph.
---
## Core Principle
The ontology is **NOT fixed**. Types and relations emerge from documents as they are ingested.
However, the ontology must remain **compact, consistent, and reusable**.
---
## Entity Type Rules
### Normalization
When assigning an entity type, apply these transformations:
1. Convert to **lowercase**
2. Strip leading/trailing whitespace
3. Replace underscores and hyphens with spaces
4. Merge synonymous types using the mapping table below
### Synonym Mapping (Entity Types)
| Variant | Canonical Type |
|---------|---------------|
| component, module, class, function | component |
| bug, defect, fault, error, failure | issue |
| server, host, machine, node | infrastructure |
| user, person, operator, admin, actor | actor |
| app, application, service, program, software | software |
| database, datastore, db, storage | storage |
| api, endpoint, interface, connection | interface |
| event, incident, occurrence, trigger | event |
| concept, idea, principle, theory | concept |
| process, thread, task, job, workflow | process |
### Adding New Types
If an entity does not match any existing type:
- Create a **new type** if it is genuinely distinct
- Keep the label short (13 words, lowercase)
- Consider whether an existing type is close enough before creating a new one
### Constraint
- Maximum ~50 distinct entity types across the entire ontology
- If the limit is approached, merge similar types rather than creating new ones
---
## Relation Type Rules
### Normalization
When assigning a relation type:
1. Convert to **lowercase**
2. Strip whitespace
3. Use verb phrases in **present tense** (e.g., "causes", "contains", "uses")
4. Merge synonyms using the mapping table below
### Synonym Mapping (Relation Types)
| Variant | Canonical Relation |
|---------|-------------------|
| triggers, leads to, results in, produces | causes |
| is part of, belongs to, lives in, sits in | contains |
| depends on, requires, needs | depends on |
| uses, calls, invokes, consumes | uses |
| affects, impacts, influences | affects |
| creates, instantiates, spawns | creates |
| connects to, links to, references | connects to |
| inherits from, extends, subclasses | extends |
| reads from, queries, fetches | reads from |
| writes to, stores in, persists to | writes to |
### Adding New Relations
- Only add new relation types if no existing type accurately describes the relationship
- Prefer canonical relations over creating new ones
---
## Ontology Update Protocol
When processing extracted entities/relations from `ingestion.md`:
1. For each entity type:
- Run through the synonym mapping
- Call `ontology_store.normalize_type(type_name)` to get the canonical form
- Call `ontology_store.add_type(canonical_type)` to register it
2. For each relation type:
- Run through the synonym mapping
- Call `ontology_store.normalize_relation(relation_name)` to get the canonical form
- Call `ontology_store.add_relation(canonical_relation)` to register it
3. Use the **canonical** type/relation names when creating nodes and edges in the graph.
@@ -0,0 +1,163 @@
# Retrieval Instructions
This file defines how the agent answers queries using the two-layer retrieval strategy:
**wiki-first** (fast path), then **graph traversal with evidence** (deep path).
---
## Overview
Retrieval is a 7-step process:
1. Parse the query
2. **Check the wiki first** (fast path)
3. Find seed nodes in the graph
4. Expand the graph via BFS
5. Prune noisy nodes
6. Build the subgraph with provenance
7. Return structured context
---
## Step 1: Parse the Query
Read the query string and identify:
- **Key noun phrases**: potential entity names (e.g., "system crash", "memory leak")
- **Keywords**: individual meaningful words (e.g., "crash", "leak", "memory")
- Normalize all terms to **lowercase**
Ignore stopwords (e.g., "the", "a", "is", "why", "does", "how", "what").
---
## Step 2: Check the Wiki First (Fast Path)
Before touching the graph, search the wiki. The wiki contains compiled knowledge —
cross-references already resolved, contradictions flagged, syntheses written.
```python
from scripts.tools import wiki_store
results = wiki_store.search_wiki(query)
```
For each relevant result, read the page:
```python
content = wiki_store.read_page_by_slug(result["slug"])
```
**If the wiki has a sufficient answer:**
- Synthesize from wiki pages.
- Cite the source pages (e.g., "According to [[memory-leak]] and [[system-crash]]...").
- File the answer as a new wiki topic page if it's valuable and not already captured:
```python
wiki_store.write_page(category="topic", title="Why System Crashes", content=..., summary=...)
```
- **Return early** — no graph traversal needed.
**If the wiki answer is incomplete or missing:** proceed to Step 3.
---
## Step 3: Find Seed Nodes
Call `index_store.search(query)` with the original query string.
This returns node IDs matching entity names or keywords.
If no seed nodes are found:
- Try searching with individual keywords from Step 1.
- If still no results, return an empty subgraph: "No relevant entities found."
---
## Step 4: Expand the Graph (BFS)
Call `retrieval_engine.retrieve(seed_node_ids, depth=2)`.
BFS from seed nodes:
- **Depth 1**: direct neighbors
- **Depth 2**: neighbors of neighbors
Rules:
- Only traverse edges with confidence ≥ MIN_CONFIDENCE (from config.py)
- Do NOT traverse beyond depth 2
- Collect all visited node IDs
---
## Step 5: Prune Nodes
- Limit total nodes to MAX_NODES (from config.py)
- Prioritize:
1. Seed nodes (always include)
2. Nodes at depth 1
3. Nodes at depth 2 (as space allows)
- Remove nodes only weakly connected (edge confidence < MIN_CONFIDENCE)
---
## Step 6: Build the Subgraph with Provenance
For a standard query, call:
```python
subgraph = skill.query(query)
# Returns: {"nodes": {node_id: {name, type, source_document, source_chunks}},
# "edges": [{source, target, type, confidence, source_document, supporting_text, chunk_id}]}
```
For queries requiring evidence (citations, fact-checking), call:
```python
result = skill.query_with_evidence(query)
# Returns:
# {
# "query": str,
# "subgraph": {"nodes": {...}, "edges": [...]},
# "supporting_documents": [
# {
# "doc_id": str,
# "doc_title": str,
# "supporting_chunks": [{"chunk_id": str, "text": str}, ...]
# }
# ],
# "evidence_chain": "memory leak --[causes]--> system crash"
# }
```
---
## Step 7: Return Structured Context
Return the result with:
- **Subgraph**: nodes + edges (the graph answer)
- **Supporting documents**: source chunks that prove each relation
- **Evidence chain**: human-readable path summary
- **Wiki references**: links to relevant wiki pages found in Step 2
**If valuable, file the answer back into the wiki:**
```python
wiki_store.write_page(
category="topic",
title=query,
content=f"# {query}\n\n**Evidence chain:** {result['evidence_chain']}\n\n...",
summary="...",
)
```
This way, future queries on the same topic find the answer instantly in the wiki.
---
## Rules
- NEVER fabricate nodes or edges not present in the graph
- NEVER traverse deeper than depth 2
- ALWAYS check the wiki before the graph (wiki-first)
- Always include seed nodes in the result, even if they have no edges
- Prefer edges with higher confidence when pruning
- File valuable answers back into the wiki as topic pages
- Return an empty subgraph (not an error) if no relevant nodes are found
@@ -0,0 +1,23 @@
"""
config.py — Global configuration constants for the Context Graph Skill.
Data directories are resolved from environment variables so the skill can be
used from any project without writing data inside the skill package itself.
MINI_CONTEXT_GRAPH_DATA_DIR — where graph.json, index.json, etc. live
MINI_CONTEXT_GRAPH_WIKI_DIR — where wiki pages, index.md, and log.md live
Both default to subdirectories of the current working directory when the env
vars are not set, so data ends up in the consuming project's directory.
"""
import os
from pathlib import Path
_BASE = Path(os.environ.get("MINI_CONTEXT_GRAPH_BASE", str(Path.cwd())))
DATA_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_DATA_DIR", str(_BASE / "data")))
WIKI_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_WIKI_DIR", str(_BASE / "wiki")))
MAX_GRAPH_DEPTH: int = 2
MIN_CONFIDENCE: float = 0.6
MAX_NODES: int = 50
@@ -0,0 +1,296 @@
"""
contextgraph.py — Main interface for the Context Graph Skill.
This file is orchestration-only. All LLM reasoning lives in the .md files.
Python here only wires together the deterministic storage and retrieval tools.
Agent usage:
- ingest(): agent reads ingestion.md + ontology.md, extracts entities/relations,
then calls the tool methods directly.
- query(): agent reads retrieval.md, calls index_store.search + retrieval_engine.retrieve,
then calls graph_store.get_subgraph and returns the result.
"""
from __future__ import annotations
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent))
import config
from tools import graph_store, index_store, ontology_store, retrieval_engine, documents_store
class ContextGraphSkill:
def ingest(self, documents: list[str]) -> None:
"""
Orchestration entry point for ingesting documents into the context graph.
The agent (Copilot) MUST:
1. Read ingestion.md to understand entity/relation extraction rules.
2. Read ontology.md to apply type normalization.
3. For each document, produce a JSON with entities + relations.
4. For each entity:
- ontology_store.add_type(entity["type"])
- node_id = graph_store.add_node(entity["name"], entity["type"])
- index_store.add_entity(entity["name"], node_id)
5. For each relation (if confidence >= MIN_CONFIDENCE):
- ontology_store.add_relation(relation["type"])
- source_id = graph_store.find_node_by_name(relation["source"])
- target_id = graph_store.find_node_by_name(relation["target"])
- graph_store.add_edge(source_id, target_id, relation["type"], relation["confidence"])
This method does NOT call any LLM. It documents the agent contract only.
"""
raise NotImplementedError(
"ingest() must be driven by the Copilot agent following ingestion.md. "
"Call the tool methods directly after LLM extraction."
)
def query(self, query: str) -> dict:
"""
Orchestration entry point for retrieving a subgraph for a query.
The agent (Copilot) MUST:
1. Read retrieval.md to understand the retrieval strategy.
2. Call index_store.search(query) to get seed node_ids.
3. Call retrieval_engine.retrieve(seed_ids, depth=MAX_GRAPH_DEPTH) to expand.
4. Call graph_store.get_subgraph(node_ids) to build the result.
5. Return the subgraph dict.
This method does NOT call any LLM. It documents the agent contract only.
Returns an empty subgraph if called directly.
"""
seed_ids = index_store.search(query)
if not seed_ids:
return {"nodes": {}, "edges": []}
node_ids = retrieval_engine.retrieve(
seed_ids,
depth=config.MAX_GRAPH_DEPTH,
min_confidence=config.MIN_CONFIDENCE,
max_nodes=config.MAX_NODES,
)
return graph_store.get_subgraph(node_ids)
# ------------------------------------------------------------------
# Convenience wrappers — agents may call these directly
# ------------------------------------------------------------------
def add_node(self, name: str, node_type: str) -> str:
"""Add a node to the graph and index. Returns node_id."""
canonical_type = ontology_store.normalize_type(node_type)
ontology_store.add_type(canonical_type)
node_id = graph_store.add_node(name, canonical_type)
index_store.add_entity(name, node_id)
return node_id
def add_edge(
self, source_name: str, target_name: str, relation: str, confidence: float
) -> None:
"""Add an edge between two nodes (by name) if both exist and confidence qualifies."""
if confidence < config.MIN_CONFIDENCE:
return
source_id = graph_store.find_node_by_name(source_name)
target_id = graph_store.find_node_by_name(target_name)
if source_id is None or target_id is None:
return
canonical_relation = ontology_store.normalize_relation(relation)
ontology_store.add_relation(canonical_relation)
graph_store.add_edge(source_id, target_id, canonical_relation, confidence)
# ------------------------------------------------------------------
# LLM Wiki + RAG methods — store raw content & provenance
# ------------------------------------------------------------------
def ingest_with_content(
self,
doc_id: str,
title: str,
source: str,
raw_content: str,
entities: list[dict],
relations: list[dict],
) -> dict:
"""
Full RAG ingestion: stores raw document + chunks, then wires provenance
links from each graph node/edge back to source chunks.
The agent MUST:
1. Read the raw_content.
2. Read ingestion.md and ontology.md for extraction rules.
3. Extract entities and relations (LLM reasoning step).
4. Call this method with the results.
Args:
doc_id: Stable document identifier (e.g. "doc_001").
title: Human-readable document title.
source: Origin path or URL (immutable, never modified).
raw_content: Full text of the document.
entities: List of dicts: [{name, type, supporting_text?}, ...]
relations: List of dicts: [{source, target, type, confidence,
supporting_text?, chunk_hint?}, ...]
Returns:
Summary dict: {doc_id, chunk_count, nodes_added, edges_added}
"""
# Step 1: Store raw document and auto-chunk
doc = documents_store.add_document(doc_id, title, source, raw_content)
chunks = doc["chunks"]
def _find_best_chunk(text: str) -> str | None:
"""Find the chunk whose text most overlaps with the given span."""
if not text or not chunks:
return None
text_lower = text.lower()
best_chunk_id = None
best_score = 0
for chunk in chunks:
if text_lower in chunk["text"].lower():
return chunk["chunk_id"]
# Fallback: count overlapping words
words_text = set(text_lower.split())
words_chunk = set(chunk["text"].lower().split())
score = len(words_text & words_chunk)
if score > best_score:
best_score = score
best_chunk_id = chunk["chunk_id"]
return best_chunk_id
nodes_added = 0
# Step 2: Ingest entities with provenance
for entity in entities:
supporting = entity.get("supporting_text", "")
chunk_id = _find_best_chunk(supporting)
chunk_ids = [chunk_id] if chunk_id else []
canonical_type = ontology_store.normalize_type(entity["type"])
ontology_store.add_type(canonical_type)
node_id = graph_store.add_node(
entity["name"],
canonical_type,
source_document=doc_id,
source_chunks=chunk_ids,
)
index_store.add_entity(entity["name"], node_id)
nodes_added += 1
edges_added = 0
# Step 3: Ingest relations with provenance
for rel in relations:
if rel.get("confidence", 0) < config.MIN_CONFIDENCE:
continue
supporting = rel.get("supporting_text", "")
chunk_id = _find_best_chunk(supporting) or rel.get("chunk_hint")
source_id = graph_store.find_node_by_name(rel["source"])
target_id = graph_store.find_node_by_name(rel["target"])
if source_id is None or target_id is None:
continue
canonical_relation = ontology_store.normalize_relation(rel["type"])
ontology_store.add_relation(canonical_relation)
graph_store.add_edge(
source_id,
target_id,
canonical_relation,
rel["confidence"],
source_document=doc_id,
supporting_text=supporting or None,
chunk_id=chunk_id,
)
edges_added += 1
return {
"doc_id": doc_id,
"chunk_count": len(chunks),
"nodes_added": nodes_added,
"edges_added": edges_added,
}
def query_with_evidence(self, query: str) -> dict:
"""
Query the graph and return the subgraph together with supporting
source documents and chunks (evidence chain).
Returns:
{
"query": str,
"subgraph": {"nodes": {...}, "edges": [...]},
"supporting_documents": [
{
"doc_id": str,
"doc_title": str,
"supporting_chunks": [{"chunk_id": str, "text": str}, ...]
}
],
"evidence_chain": str # human-readable summary path
}
"""
subgraph = self.query(query)
if not subgraph["nodes"]:
return {
"query": query,
"subgraph": subgraph,
"supporting_documents": [],
"evidence_chain": "No matching nodes found.",
}
# Collect all provenance pointers from nodes and edges
docs_chunks: dict[str, list[str]] = {} # doc_id -> [chunk_ids]
for node in subgraph["nodes"].values():
doc_id = node.get("source_document")
if doc_id:
docs_chunks.setdefault(doc_id, [])
docs_chunks[doc_id].extend(node.get("source_chunks") or [])
for edge in subgraph["edges"]:
doc_id = edge.get("source_document")
if doc_id:
docs_chunks.setdefault(doc_id, [])
if edge.get("chunk_id"):
docs_chunks[doc_id].append(edge["chunk_id"])
# Resolve chunk texts from documents_store
supporting_documents = []
for doc_id, chunk_ids in docs_chunks.items():
doc = documents_store.get_document(doc_id)
if doc is None:
continue
seen = set()
chunks_out = []
for cid in chunk_ids:
if cid in seen:
continue
seen.add(cid)
chunk = documents_store.get_chunk(cid)
if chunk:
chunks_out.append({"chunk_id": cid, "text": chunk["text"]})
if chunks_out:
supporting_documents.append({
"doc_id": doc_id,
"doc_title": doc["title"],
"supporting_chunks": chunks_out,
})
# Build a simple evidence chain string
chain_parts = []
for edge in subgraph["edges"]:
src_node = subgraph["nodes"].get(edge["source"], {})
tgt_node = subgraph["nodes"].get(edge["target"], {})
src_name = src_node.get("name", edge["source"])
tgt_name = tgt_node.get("name", edge["target"])
chain_parts.append(f"{src_name} --[{edge['type']}]--> {tgt_name}")
evidence_chain = " | ".join(chain_parts) if chain_parts else "No edges in subgraph."
return {
"query": query,
"subgraph": subgraph,
"supporting_documents": supporting_documents,
"evidence_chain": evidence_chain,
}
@@ -0,0 +1,198 @@
"""
template_agent_workflow.py Template agent script for ingesting + querying the context graph.
This script demonstrates the complete workflow an agent should follow:
1. Read markdown guidance files
2. Extract entities/relations via LLM reasoning
3. Call Python methods to persist
4. Query the graph
5. Handle errors gracefully
Copy and adapt this template for your agent implementation.
"""
import json
import sys
from pathlib import Path
# Add tools to path
sys.path.insert(0, str(Path(__file__).parent))
from contextgraph import ContextGraphSkill
def ingest_document(skill: ContextGraphSkill, document: str) -> dict:
"""
Step 1: Agent reads ingestion.md and ontology.md
Step 2: Agent uses LLM to extract entities and relations
Step 3: Call Python methods to persist (mimicked here with static extraction)
In a real agent, replace the static extraction with LLM calls.
"""
print(f"\n[INGEST] Processing document:\n{document}\n")
# --- STEP 1 & 2: LLM EXTRACTION PHASE (Guided by ingestion.md + ontology.md) ---
# In a real agent, this would use LLM reasoning.
# For now, we'll mock an extraction result:
extraction_result = {
"entities": [
{"name": "memory leak", "type": "issue"},
{"name": "system crash", "type": "issue"},
{"name": "object", "type": "component"},
],
"relations": [
{
"source": "memory leak",
"target": "system crash",
"type": "causes",
"confidence": 1.0,
},
{
"source": "object",
"target": "memory leak",
"type": "contributes to",
"confidence": 0.9,
},
],
}
print(f"[LLM] Extracted entities + relations:")
print(json.dumps(extraction_result, indent=2))
# --- STEP 3: PERSIST PHASE (Call Python methods) ---
errors = []
added_nodes = {}
for entity in extraction_result["entities"]:
try:
node_id = skill.add_node(entity["name"], entity["type"])
added_nodes[entity["name"]] = node_id
print(f" ✓ Added node: {entity['name']} (id: {node_id}, type: {entity['type']})")
except Exception as e:
errors.append(f"Failed to add node {entity['name']}: {e}")
print(f" ✗ Error adding node {entity['name']}: {e}")
for relation in extraction_result["relations"]:
# Validate both endpoints exist
if relation["source"] not in added_nodes or relation["target"] not in added_nodes:
error_msg = f"Cannot add edge: source or target missing"
errors.append(error_msg)
print(f" ✗ Skip edge {relation['source']}{relation['target']}: {error_msg}")
continue
# Validate confidence threshold
if relation["confidence"] < 0.6:
error_msg = f"Confidence {relation['confidence']} < 0.6 (minimum threshold)"
errors.append(error_msg)
print(f" ✗ Skip edge {relation['source']}{relation['target']}: {error_msg}")
continue
try:
skill.add_edge(
source_name=relation["source"],
target_name=relation["target"],
relation=relation["type"],
confidence=relation["confidence"],
)
print(
f" ✓ Added edge: {relation['source']} "
f"--[{relation['type']}]→ {relation['target']} "
f"(confidence: {relation['confidence']})"
)
except Exception as e:
errors.append(f"Failed to add edge {relation['source']}{relation['target']}: {e}")
print(f" ✗ Error adding edge: {e}")
return {
"success": len(errors) == 0,
"nodes_added": len(added_nodes),
"edges_added": len(extraction_result["relations"]) - len(
[e for e in errors if "skip edge" in e.lower()]
),
"errors": errors,
}
def query_graph(skill: ContextGraphSkill, query: str) -> dict:
"""
Query the graph for context to answer the user's question.
Step 1: Read retrieval.md
Step 2: Call skill.query() which internally handles BFS + subgraph extraction
Step 3: Return structured context
"""
print(f"\n[QUERY] {query}\n")
try:
subgraph = skill.query(query)
if not subgraph["nodes"]:
print(" No relevant entities found in graph.")
return {
"success": True,
"query": query,
"subgraph": subgraph,
"nodes_found": 0,
"edges_found": 0,
}
print(f" ✓ Retrieved subgraph with {len(subgraph['nodes'])} nodes, {len(subgraph['edges'])} edges")
print(f"\n Nodes:")
for node_id, node in subgraph["nodes"].items():
print(f" - {node['name']} (type: {node['type']}, id: {node_id})")
print(f"\n Edges:")
for edge in subgraph["edges"]:
source_name = subgraph["nodes"][edge["source"]]["name"]
target_name = subgraph["nodes"][edge["target"]]["name"]
print(
f" - {source_name} --[{edge['type']}]→ {target_name} "
f"(confidence: {edge['confidence']})"
)
return {
"success": True,
"query": query,
"subgraph": subgraph,
"nodes_found": len(subgraph["nodes"]),
"edges_found": len(subgraph["edges"]),
}
except Exception as e:
error_msg = f"Query failed: {e}"
print(f"{error_msg}")
return {"success": False, "query": query, "error": error_msg}
def main():
"""Demo: ingest a document, then query the graph."""
skill = ContextGraphSkill()
# ===== INGESTION =====
document = """
System crashes due to memory leaks.
Memory leaks occur when objects are not released.
"""
result = ingest_document(skill, document)
print(f"\n[INGEST RESULT] Nodes added: {result['nodes_added']}, " f"Edges added: {result['edges_added']}")
if result["errors"]:
print(f"Errors: {result['errors']}")
# ===== RETRIEVAL =====
queries = [
"Why does the system crash?",
"What causes memory leaks?",
]
for query in queries:
result = query_graph(skill, query)
if result["success"]:
print(f" Nodes found: {result['nodes_found']}, Edges found: {result['edges_found']}")
else:
print(f" Error: {result['error']}")
if __name__ == "__main__":
main()
@@ -0,0 +1,191 @@
"""
documents_store.py Persistent storage for raw documents and chunks (RAG layer).
Inspired by Karpathy's LLM Wiki pattern: raw sources are immutable and stored
as the ground truth. Chunks are the retrieval unit; provenance links tie graph
nodes/edges back to specific chunks.
Handles:
- Storing raw documents with metadata
- Chunking documents into overlapping text windows
- Retrieving chunks by id or by keyword search
- Persisting to data/documents.json
"""
from __future__ import annotations
import json
import os
import re
import sys
import uuid
from datetime import datetime, timezone
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import config
_DATA_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_DATA_DIR", str(config.DATA_DIR)))
_DOCS_FILE = _DATA_DIR / "documents.json"
_CHUNK_SIZE = 500 # characters per chunk
_CHUNK_OVERLAP = 100 # overlap between consecutive chunks
_STOPWORDS = frozenset([
"a", "an", "the", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "do", "does", "did", "will", "would", "could",
"should", "may", "might", "shall", "can", "to", "of", "in", "on",
"at", "by", "for", "with", "from", "and", "or", "but", "not", "it",
"its", "this", "that", "these", "those", "i", "you", "he", "she",
"we", "they", "what", "which", "who", "how", "why", "when", "where",
])
def _load() -> dict:
if _DOCS_FILE.exists():
with open(_DOCS_FILE, "r") as f:
return json.load(f)
return {"documents": {}}
def _save(store: dict) -> None:
_DATA_DIR.mkdir(parents=True, exist_ok=True)
with open(_DOCS_FILE, "w") as f:
json.dump(store, f, indent=2)
def _tokenize(text: str) -> list[str]:
tokens = re.findall(r"[a-z0-9]+", text.lower())
return [t for t in tokens if t not in _STOPWORDS and len(t) > 1]
def _chunk_text(content: str, chunk_size: int = _CHUNK_SIZE, overlap: int = _CHUNK_OVERLAP) -> list[str]:
"""Split content into overlapping character windows."""
chunks = []
start = 0
while start < len(content):
end = start + chunk_size
chunks.append(content[start:end].strip())
if end >= len(content):
break
start += chunk_size - overlap
return [c for c in chunks if c]
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def add_document(
doc_id: str,
title: str,
source: str,
content: str,
) -> dict:
"""
Store a raw document and auto-generate chunks.
Args:
doc_id: Caller-supplied stable identifier (e.g. "doc_001" or a filename).
title: Human-readable title.
source: Origin path/URL (immutable provenance pointer).
content: Full raw text to store and chunk.
Returns:
The stored document dict including generated chunk_ids.
"""
store = _load()
# Idempotent: return existing doc if already stored
if doc_id in store["documents"]:
return store["documents"][doc_id]
raw_chunks = _chunk_text(content)
chunks = []
for i, text in enumerate(raw_chunks):
chunks.append({
"chunk_id": f"{doc_id}_chunk_{i:03d}",
"index": i,
"text": text,
})
doc = {
"id": doc_id,
"title": title,
"source": source,
"content": content,
"chunks": chunks,
"ingestion_date": datetime.now(timezone.utc).isoformat(),
}
store["documents"][doc_id] = doc
_save(store)
return doc
def get_document(doc_id: str) -> dict | None:
"""Return the full document record or None if not found."""
store = _load()
return store["documents"].get(doc_id)
def get_chunk(chunk_id: str) -> dict | None:
"""Return a specific chunk by its chunk_id (searches across all documents)."""
store = _load()
for doc in store["documents"].values():
for chunk in doc["chunks"]:
if chunk["chunk_id"] == chunk_id:
return chunk
return None
def get_chunks_for_document(doc_id: str) -> list[dict]:
"""Return all chunks for a document."""
doc = get_document(doc_id)
if doc is None:
return []
return doc["chunks"]
def search_chunks(query: str, top_k: int = 5) -> list[dict]:
"""
Keyword search over chunk text. Returns top_k matching chunks sorted by
term overlap (simple TF-style scoring, no embeddings required).
Returns list of dicts with keys: chunk_id, doc_id, score, text.
"""
store = _load()
query_tokens = set(_tokenize(query))
if not query_tokens:
return []
scored: list[tuple[float, dict]] = []
for doc in store["documents"].values():
for chunk in doc["chunks"]:
chunk_tokens = set(_tokenize(chunk["text"]))
overlap = len(query_tokens & chunk_tokens)
if overlap > 0:
score = overlap / len(query_tokens)
scored.append((score, {
"chunk_id": chunk["chunk_id"],
"doc_id": doc["id"],
"doc_title": doc["title"],
"score": round(score, 4),
"text": chunk["text"],
}))
scored.sort(key=lambda x: x[0], reverse=True)
return [item for _, item in scored[:top_k]]
def list_documents() -> list[dict]:
"""Return a summary list of all stored documents (no content, no chunks)."""
store = _load()
return [
{
"id": doc["id"],
"title": doc["title"],
"source": doc["source"],
"chunk_count": len(doc["chunks"]),
"ingestion_date": doc["ingestion_date"],
}
for doc in store["documents"].values()
]
@@ -0,0 +1,202 @@
"""
graph_store.py Persistent storage for graph nodes and edges.
Handles:
- Adding/deduplicating nodes
- Adding edges with confidence
- Fetching neighbors
- Persisting to graph.json
"""
from __future__ import annotations
import json
import os
import sys
import uuid
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import config
_DATA_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_DATA_DIR", str(config.DATA_DIR)))
_GRAPH_FILE = _DATA_DIR / "graph.json"
def _load() -> dict:
if _GRAPH_FILE.exists():
with open(_GRAPH_FILE, "r") as f:
return json.load(f)
return {"nodes": {}, "edges": []}
def _save(graph: dict) -> None:
_DATA_DIR.mkdir(parents=True, exist_ok=True)
with open(_GRAPH_FILE, "w") as f:
json.dump(graph, f, indent=2)
def add_node(
name: str,
node_type: str,
source_document: str | None = None,
source_chunks: list[str] | None = None,
) -> str:
"""
Add a node if it doesn't exist. Returns node_id.
Args:
source_document: doc_id from documents_store (provenance pointer).
source_chunks: list of chunk_ids that mention this entity.
"""
graph = _load()
name_lower = name.strip().lower()
# Deduplication: search by normalized name
for node_id, node in graph["nodes"].items():
if node["name"] == name_lower:
# Merge provenance if new info provided
changed = False
if source_document and node.get("source_document") is None:
node["source_document"] = source_document
changed = True
if source_chunks:
existing = set(node.get("source_chunks") or [])
merged = list(existing | set(source_chunks))
if merged != list(existing):
node["source_chunks"] = merged
changed = True
if changed:
_save(graph)
return node_id
node_id = str(uuid.uuid4())[:8]
graph["nodes"][node_id] = {
"name": name_lower,
"type": node_type.strip().lower(),
"source_document": source_document,
"source_chunks": source_chunks or [],
}
_save(graph)
return node_id
def add_edge(
source_id: str,
target_id: str,
relation: str,
confidence: float,
source_document: str | None = None,
supporting_text: str | None = None,
chunk_id: str | None = None,
) -> None:
"""
Add a directed edge between two nodes.
Args:
source_document: doc_id from documents_store (provenance pointer).
supporting_text: The exact text span that supports this relation.
chunk_id: The specific chunk_id the supporting text came from.
"""
graph = _load()
# Deduplicate edges by source + target + relation
relation_lower = relation.strip().lower()
for edge in graph["edges"]:
if (
edge["source"] == source_id
and edge["target"] == target_id
and edge["type"] == relation_lower
):
changed = False
if confidence > edge["confidence"]:
edge["confidence"] = confidence
changed = True
if source_document and edge.get("source_document") is None:
edge["source_document"] = source_document
changed = True
if supporting_text and edge.get("supporting_text") is None:
edge["supporting_text"] = supporting_text
changed = True
if chunk_id and edge.get("chunk_id") is None:
edge["chunk_id"] = chunk_id
changed = True
if changed:
_save(graph)
return
graph["edges"].append({
"source": source_id,
"target": target_id,
"type": relation_lower,
"confidence": confidence,
"source_document": source_document,
"supporting_text": supporting_text,
"chunk_id": chunk_id,
})
_save(graph)
def get_neighbors(node_id: str, min_confidence: float = 0.0) -> list[str]:
"""Return node_ids of all neighbors reachable from node_id."""
graph = _load()
neighbors = []
for edge in graph["edges"]:
if edge["confidence"] < min_confidence:
continue
if edge["source"] == node_id:
neighbors.append(edge["target"])
elif edge["target"] == node_id:
neighbors.append(edge["source"])
return list(set(neighbors))
def get_node(node_id: str) -> dict | None:
"""Fetch a single node by ID."""
graph = _load()
return graph["nodes"].get(node_id)
def get_subgraph(node_ids: list[str]) -> dict:
"""Return nodes and edges induced by the given node_ids."""
graph = _load()
node_id_set = set(node_ids)
nodes = {nid: graph["nodes"][nid] for nid in node_ids if nid in graph["nodes"]}
edges = [
e
for e in graph["edges"]
if e["source"] in node_id_set and e["target"] in node_id_set
]
return {"nodes": nodes, "edges": edges}
def find_node_by_name(name: str) -> str | None:
"""Return node_id for a given normalized name, or None."""
graph = _load()
name_lower = name.strip().lower()
for node_id, node in graph["nodes"].items():
if node["name"] == name_lower:
return node_id
return None
def link_node_to_source(node_id: str, doc_id: str, chunk_ids: list[str]) -> None:
"""Attach provenance (doc_id + chunk_ids) to an existing node."""
graph = _load()
if node_id not in graph["nodes"]:
return
node = graph["nodes"][node_id]
node["source_document"] = doc_id
existing = set(node.get("source_chunks") or [])
node["source_chunks"] = list(existing | set(chunk_ids))
_save(graph)
def get_node_sources(node_id: str) -> dict:
"""Return provenance info (source_document + source_chunks) for a node."""
graph = _load()
node = graph["nodes"].get(node_id, {})
return {
"source_document": node.get("source_document"),
"source_chunks": node.get("source_chunks", []),
}
@@ -0,0 +1,90 @@
"""
index_store.py Maintains entity and keyword indexes for fast lookup.
Handles:
- Entity index: name [node_ids]
- Keyword index: token [node_ids]
- Persist to index.json
"""
from __future__ import annotations
import json
import os
import re
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import config
_DATA_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_DATA_DIR", str(config.DATA_DIR)))
_INDEX_FILE = _DATA_DIR / "index.json"
_STOPWORDS = frozenset(
[
"a", "an", "the", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "do", "does", "did", "will", "would", "could",
"should", "may", "might", "shall", "can", "to", "of", "in", "on",
"at", "by", "for", "with", "from", "and", "or", "but", "not", "it",
"its", "this", "that", "these", "those", "i", "you", "he", "she",
"we", "they", "what", "which", "who", "how", "why", "when", "where",
]
)
def _load() -> dict:
if _INDEX_FILE.exists():
with open(_INDEX_FILE, "r") as f:
return json.load(f)
return {"entity_index": {}, "keyword_index": {}}
def _save(index: dict) -> None:
_DATA_DIR.mkdir(parents=True, exist_ok=True)
with open(_INDEX_FILE, "w") as f:
json.dump(index, f, indent=2)
def _tokenize(text: str) -> list[str]:
"""Split text into lowercase tokens, removing stopwords and short tokens."""
tokens = re.findall(r"[a-z0-9]+", text.lower())
return [t for t in tokens if t not in _STOPWORDS and len(t) > 1]
def add_entity(name: str, node_id: str) -> None:
"""Register an entity name → node_id in both entity and keyword indexes."""
index = _load()
name_lower = name.strip().lower()
# Entity index
if name_lower not in index["entity_index"]:
index["entity_index"][name_lower] = []
if node_id not in index["entity_index"][name_lower]:
index["entity_index"][name_lower].append(node_id)
# Keyword index
for token in _tokenize(name_lower):
if token not in index["keyword_index"]:
index["keyword_index"][token] = []
if node_id not in index["keyword_index"][token]:
index["keyword_index"][token].append(node_id)
_save(index)
def search(query: str) -> list[str]:
"""Search for node_ids matching the query via entity name or keywords."""
index = _load()
query_lower = query.strip().lower()
matched_ids: set[str] = set()
# Exact entity name match
if query_lower in index["entity_index"]:
matched_ids.update(index["entity_index"][query_lower])
# Keyword match
for token in _tokenize(query_lower):
if token in index["keyword_index"]:
matched_ids.update(index["keyword_index"][token])
return list(matched_ids)
@@ -0,0 +1,175 @@
"""
ontology_store.py Tracks entity types and relation types.
Handles:
- Registering types and relations with usage counts
- Normalizing types and relations via synonym mapping
- Persisting to ontology.json
NOTE: No LLM logic here. Normalization is rule-based (lowercase + synonym map).
"""
import json
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import config
_DATA_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_DATA_DIR", str(config.DATA_DIR)))
_ONTOLOGY_FILE = _DATA_DIR / "ontology.json"
# Synonym maps — lowercase variants map to canonical forms
_ENTITY_TYPE_MAP: dict[str, str] = {
"component": "component",
"module": "component",
"class": "component",
"function": "component",
"method": "component",
"bug": "issue",
"defect": "issue",
"fault": "issue",
"error": "issue",
"failure": "issue",
"problem": "issue",
"crash": "issue",
"server": "infrastructure",
"host": "infrastructure",
"machine": "infrastructure",
"node": "infrastructure",
"user": "actor",
"person": "actor",
"operator": "actor",
"admin": "actor",
"administrator": "actor",
"actor": "actor",
"app": "software",
"application": "software",
"service": "software",
"program": "software",
"software": "software",
"database": "storage",
"datastore": "storage",
"db": "storage",
"storage": "storage",
"api": "interface",
"endpoint": "interface",
"interface": "interface",
"connection": "interface",
"event": "event",
"incident": "event",
"occurrence": "event",
"trigger": "event",
"concept": "concept",
"idea": "concept",
"principle": "concept",
"theory": "concept",
"process": "process",
"thread": "process",
"task": "process",
"job": "process",
"workflow": "process",
"object": "component",
"resource": "component",
"memory": "resource",
"cpu": "resource",
"system": "system",
"platform": "system",
"framework": "system",
"library": "software",
"package": "software",
}
_RELATION_TYPE_MAP: dict[str, str] = {
"causes": "causes",
"triggers": "causes",
"leads to": "causes",
"results in": "causes",
"produces": "causes",
"is part of": "contains",
"belongs to": "contains",
"lives in": "contains",
"sits in": "contains",
"contains": "contains",
"depends on": "depends on",
"requires": "depends on",
"needs": "depends on",
"uses": "uses",
"calls": "uses",
"invokes": "uses",
"consumes": "uses",
"affects": "affects",
"impacts": "affects",
"influences": "affects",
"creates": "creates",
"instantiates": "creates",
"spawns": "creates",
"connects to": "connects to",
"links to": "connects to",
"references": "connects to",
"inherits from": "extends",
"extends": "extends",
"subclasses": "extends",
"reads from": "reads from",
"queries": "reads from",
"fetches": "reads from",
"writes to": "writes to",
"stores in": "writes to",
"persists to": "writes to",
"contributes to": "contributes to",
"allocated by": "allocated by",
"released by": "released by",
"not released": "not released",
}
def _load() -> dict:
if _ONTOLOGY_FILE.exists():
with open(_ONTOLOGY_FILE, "r") as f:
return json.load(f)
return {"entity_types": {}, "relation_types": {}}
def _save(ontology: dict) -> None:
_DATA_DIR.mkdir(parents=True, exist_ok=True)
with open(_ONTOLOGY_FILE, "w") as f:
json.dump(ontology, f, indent=2)
def normalize_type(type_name: str) -> str:
"""Return the canonical form of an entity type."""
key = type_name.strip().lower().replace("-", " ").replace("_", " ")
return _ENTITY_TYPE_MAP.get(key, key)
def normalize_relation(relation_name: str) -> str:
"""Return the canonical form of a relation type."""
key = relation_name.strip().lower().replace("-", " ").replace("_", " ")
return _RELATION_TYPE_MAP.get(key, key)
def add_type(type_name: str) -> None:
"""Register an entity type, incrementing its usage count."""
ontology = _load()
canonical = normalize_type(type_name)
ontology["entity_types"][canonical] = ontology["entity_types"].get(canonical, 0) + 1
_save(ontology)
def add_relation(relation_name: str) -> None:
"""Register a relation type, incrementing its usage count."""
ontology = _load()
canonical = normalize_relation(relation_name)
ontology["relation_types"][canonical] = ontology["relation_types"].get(canonical, 0) + 1
_save(ontology)
def get_all_types() -> dict[str, int]:
"""Return all registered entity types with counts."""
return _load()["entity_types"]
def get_all_relations() -> dict[str, int]:
"""Return all registered relation types with counts."""
return _load()["relation_types"]
@@ -0,0 +1,58 @@
"""
retrieval_engine.py BFS-based graph traversal for context retrieval.
Input: seed node_ids + depth
Output: list of node_ids within traversal depth filtered by min_confidence
"""
from __future__ import annotations
import sys
from pathlib import Path
from collections import deque
# Allow imports from parent package
sys.path.insert(0, str(Path(__file__).parent.parent))
from tools import graph_store
import config
def retrieve(
seed_node_ids: list[str],
depth: int = config.MAX_GRAPH_DEPTH,
min_confidence: float = config.MIN_CONFIDENCE,
max_nodes: int = config.MAX_NODES,
) -> list[str]:
"""
BFS from seed nodes up to `depth` hops.
Returns a list of node_ids (including seeds) within the traversal,
filtered by min_confidence on edges and capped at max_nodes.
"""
visited: set[str] = set()
# Queue items: (node_id, current_depth)
queue: deque[tuple[str, int]] = deque()
for seed in seed_node_ids:
if seed not in visited:
visited.add(seed)
queue.append((seed, 0))
while queue:
if len(visited) >= max_nodes:
break
node_id, current_depth = queue.popleft()
if current_depth >= depth:
continue
neighbors = graph_store.get_neighbors(node_id, min_confidence=min_confidence)
for neighbor in neighbors:
if neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, current_depth + 1))
if len(visited) >= max_nodes:
break
return list(visited)
@@ -0,0 +1,294 @@
"""
wiki_store.py Manages the persistent wiki layer.
Inspired by Karpathy's LLM Wiki pattern: the wiki is a directory of LLM-generated
markdown pages that the agent writes and maintains. This module provides the
deterministic file I/O and index/log management so the agent can focus on
reasoning, not bookkeeping.
Wiki structure (relative to project root):
wiki/
index.md content-oriented catalog of all pages
log.md chronological append-only operation log
entities/ one page per entity (person, concept, system, etc.)
summaries/ source document summary pages
topics/ cross-cutting synthesis and topic pages
The agent WRITES pages; this module handles the filesystem + index + log.
"""
from __future__ import annotations
import os
import re
import sys
from datetime import datetime, timezone
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import config
_WIKI_DIR = Path(os.environ.get("MINI_CONTEXT_GRAPH_WIKI_DIR", str(config.WIKI_DIR)))
_INDEX_FILE = _WIKI_DIR / "index.md"
_LOG_FILE = _WIKI_DIR / "log.md"
_CATEGORY_DIRS = {
"entity": _WIKI_DIR / "entities",
"summary": _WIKI_DIR / "summaries",
"topic": _WIKI_DIR / "topics",
}
# ---------------------------------------------------------------------------
# Internal helpers
# ---------------------------------------------------------------------------
def _ensure_dirs() -> None:
_WIKI_DIR.mkdir(parents=True, exist_ok=True)
for d in _CATEGORY_DIRS.values():
d.mkdir(parents=True, exist_ok=True)
def _now_iso() -> str:
return datetime.now(timezone.utc).strftime("%Y-%m-%d")
def _slug(title: str) -> str:
"""Convert a title to a filesystem-safe slug."""
slug = title.lower().strip()
slug = re.sub(r"[^a-z0-9]+", "-", slug)
return slug.strip("-")
def _page_path(category: str, slug: str) -> Path:
base = _CATEGORY_DIRS.get(category, _WIKI_DIR)
return base / f"{slug}.md"
# ---------------------------------------------------------------------------
# Index management
# ---------------------------------------------------------------------------
def _load_index() -> list[dict]:
"""Parse index.md into a list of entry dicts."""
if not _INDEX_FILE.exists():
return []
entries = []
for line in _INDEX_FILE.read_text().splitlines():
# Expected table row: | [[slug]] | category | summary | date |
if line.startswith("| [["):
parts = [p.strip() for p in line.split("|") if p.strip()]
if len(parts) >= 3:
link = parts[0] # [[slug]]
category = parts[1] if len(parts) > 1 else ""
summary = parts[2] if len(parts) > 2 else ""
date = parts[3] if len(parts) > 3 else ""
slug = re.sub(r"\[\[|\]\]", "", link)
entries.append({
"slug": slug,
"category": category,
"summary": summary,
"date": date,
})
return entries
def _save_index(entries: list[dict]) -> None:
"""Rewrite index.md from the entries list."""
_ensure_dirs()
lines = [
"# Wiki Index\n",
"_Auto-managed by wiki_store. Do not edit the table manually._\n\n",
"| Page | Category | Summary | Date |\n",
"|------|----------|---------|------|\n",
]
for e in entries:
lines.append(
f"| [[{e['slug']}]] | {e['category']} | {e['summary']} | {e['date']} |\n"
)
_INDEX_FILE.write_text("".join(lines))
def _append_log(operation: str, detail: str) -> None:
"""Append a timestamped entry to log.md."""
_ensure_dirs()
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d")
entry = f"\n## [{timestamp}] {operation} | {detail}\n"
with open(_LOG_FILE, "a") as f:
f.write(entry)
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def write_page(
category: str,
title: str,
content: str,
summary: str = "",
) -> str:
"""
Write (or overwrite) a wiki page.
The agent provides the full markdown content. This method handles:
- Writes the .md file to the appropriate category subfolder.
- Updates index.md with a one-line entry.
- Appends an entry to log.md.
Args:
category: One of "entity", "summary", "topic".
title: Human-readable page title (used for slug + index).
content: Full markdown content the agent wrote.
summary: One-line summary for the index (optional; auto-extracted if empty).
Returns:
Relative path from wiki root (e.g. "entities/memory-leak.md").
"""
_ensure_dirs()
slug = _slug(title)
path = _page_path(category, slug)
# Auto-extract first non-heading, non-empty line as summary if not provided
if not summary:
for line in content.splitlines():
stripped = line.strip()
if stripped and not stripped.startswith("#"):
summary = stripped[:100]
break
path.write_text(content)
# Update index
entries = _load_index()
existing = next((e for e in entries if e["slug"] == slug), None)
if existing:
existing["summary"] = summary
existing["date"] = _now_iso()
else:
entries.append({
"slug": slug,
"category": category,
"summary": summary,
"date": _now_iso(),
})
_save_index(entries)
_append_log("write", title)
return str(path.relative_to(_WIKI_DIR))
def read_page(category: str, title: str) -> str | None:
"""Read a wiki page's content. Returns None if not found."""
slug = _slug(title)
path = _page_path(category, slug)
if not path.exists():
return None
return path.read_text()
def read_page_by_slug(slug: str) -> str | None:
"""Read a wiki page by slug, searching across all categories."""
for d in list(_CATEGORY_DIRS.values()) + [_WIKI_DIR]:
path = d / f"{slug}.md"
if path.exists():
return path.read_text()
return None
def search_wiki(query: str) -> list[dict]:
"""
Simple keyword search over all wiki pages.
Returns list of {slug, category, path, snippet} sorted by relevance.
"""
query_tokens = set(re.findall(r"[a-z0-9]+", query.lower()))
if not query_tokens:
return []
results = []
for category, base_dir in _CATEGORY_DIRS.items():
if not base_dir.exists():
continue
for page_path in base_dir.glob("*.md"):
content = page_path.read_text().lower()
content_tokens = set(re.findall(r"[a-z0-9]+", content))
overlap = len(query_tokens & content_tokens)
if overlap > 0:
# Extract a short snippet around first match
first_token = next(iter(query_tokens & content_tokens), "")
idx = content.find(first_token)
snippet = content[max(0, idx - 30):idx + 80].replace("\n", " ").strip()
results.append({
"slug": page_path.stem,
"category": category,
"path": str(page_path.relative_to(_WIKI_DIR)),
"score": overlap,
"snippet": snippet,
})
results.sort(key=lambda x: x["score"], reverse=True)
return results
def list_pages(category: str | None = None) -> list[dict]:
"""List all wiki pages, optionally filtered by category."""
entries = _load_index()
if category:
return [e for e in entries if e["category"] == category]
return entries
def get_log(last_n: int = 20) -> list[str]:
"""Return the last N log entries from log.md."""
if not _LOG_FILE.exists():
return []
lines = _LOG_FILE.read_text().splitlines()
entries = [l for l in lines if l.startswith("## [")]
return entries[-last_n:]
def lint_wiki() -> dict:
"""
Health-check the wiki as described in Karpathy's LLM Wiki pattern.
Checks for:
- Orphan pages (in directory but not in index)
- Missing pages (in index but file deleted)
- Broken wikilinks ([[slug]] pointing to non-existent file)
- Pages with no wikilinks (isolated pages)
Returns:
{
"orphan_pages": [...],
"missing_pages": [...],
"broken_wikilinks": {slug: [broken_links]},
"isolated_pages": [...],
}
"""
index_entries = {e["slug"] for e in _load_index()}
file_slugs: dict[str, Path] = {}
for d in _CATEGORY_DIRS.values():
if d.exists():
for p in d.glob("*.md"):
file_slugs[p.stem] = p
orphans = [s for s in file_slugs if s not in index_entries]
missing = [s for s in index_entries if s not in file_slugs]
broken_wikilinks: dict[str, list[str]] = {}
isolated: list[str] = []
all_slugs = set(file_slugs.keys())
for slug, path in file_slugs.items():
content = path.read_text()
links = re.findall(r"\[\[([^\]]+)\]\]", content)
if not links:
isolated.append(slug)
broken = [lnk for lnk in links if _slug(lnk) not in all_slugs]
if broken:
broken_wikilinks[slug] = broken
return {
"orphan_pages": orphans,
"missing_pages": missing,
"broken_wikilinks": broken_wikilinks,
"isolated_pages": isolated,
}