Skip to content

AncientPatata/vaultex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vaultex

Vaultex is a GraphRAG-powered MCP server that transforms an Obsidian vault into a semantically queryable knowledge graph. It indexes your notes at the atomic proposition level, weaves propositions into a typed knowledge graph, and exposes the result over the Model Context Protocol so any MCP-compatible AI client can search and reason over your vault.

Unlike file-access bridges that let a model read raw markdown, Vaultex breaks every note into self-contained facts, embeds them, and connects them through three complementary edge types — making retrieval both semantically precise and structurally aware of your wikilink graph.


How It Works

1. Ingestion

When you run vaultex ingest, Vaultex walks your vault and processes each .md file:

  1. Split the note into sections by Markdown header hierarchy.
  2. Extract atomic propositions from each section using Claude Haiku. Every proposition is self-contained — pronouns are resolved, context is included, one fact per entry.
  3. Embed all propositions with text-embedding-3-small in one batched API call.
  4. Store the vectors in LanceDB and record metadata in SQLite.
  5. Build three types of graph edges (see below).

Content hashes are stored per-note, so incremental re-ingestion only processes changed files.

2. The Knowledge Graph

Every proposition is a node. Edges connect them in three ways:

Edge Type Weight When Created
SAME_NOTE 2.0 Between every pair of propositions extracted from the same note
HARD_LINK 1.0 When note A has a [[wikilink]] to note B — connects A's propositions to B's
SOFT_LINK cosine similarity Cross-note pairs whose embedding similarity exceeds an adaptive percentile threshold

Graph traversal scores paths by the product of edge weights along the route. SAME_NOTE → SAME_NOTE = 4.0 (certain), SOFT_LINK → SOFT_LINK = 0.36 (speculative).

3. The MCP Server

vaultex serve starts an MCP server. Any MCP-compatible client (Claude Desktop, Claude Code, Cursor, etc.) can then call six tools — from raw semantic search up to a fully automated multi-step research pipeline.

4. Live Watching

With watching enabled (the default), Vaultex monitors your vault with a debounced file watcher. Saving a note re-ingests it within a few seconds. Deleting a note removes all its propositions from the index immediately.


Configuration

Vaultex looks for configuration in four places, with later sources overriding earlier ones:

  1. A .env file in the working directory
  2. System environment variables
  3. .vaultex/config.yaml inside the vault
  4. VAULTEX_* prefixed environment variables (highest priority)

Required

Variable Description
OPENROUTER_API_KEY Routes calls to both the Anthropic and OpenAI models via openrouter.ai

Full Configuration Reference

These can be set as environment variables or in .vaultex/config.yaml:

# Core
exclude_patterns:               # Glob patterns to skip during ingestion
  - "templates/**"              # default
  - "_archive/**"               # default
  - ".obsidian/**"              # default
debounce_seconds: 5             # File watcher debounce interval (seconds)

# API concurrency
max_concurrent_api_calls: 5     # Parallel LLM extraction threads per note

# Models (all routed via OpenRouter)
openai_model: text-embedding-3-small
embedding_dimensions: 1536
haiku_model: anthropic/claude-haiku-4-5
deep_research_sonnet_model: anthropic/claude-sonnet-4.6

# Soft-link thresholding
soft_link_percentile: 95              # Top N% most similar cross-note pairs get an edge
soft_link_recalc_growth_trigger: 0.20 # Skip recalc if proposition count grew < 20%

# Storage
data_dir: .vaultex              # Relative to vault root, or absolute path

Deep research defaults:

deep_research_default_depth: standard
deep_research_standard_top_k: 10
deep_research_thorough_top_k: 20
deep_research_standard_walks: 3
deep_research_thorough_walks: 5
deep_research_standard_note_budget: 1500   # chars of note context per note
deep_research_thorough_note_budget: 3000

Running Vaultex

Initial Ingestion

Run this once before starting the server. On a vault of a few hundred notes it takes a couple of minutes.

vaultex ingest --vault /path/to/your/vault

Force a full re-ingest (ignores content hashes):

vaultex ingest --vault /path/to/your/vault --force

Ingest a single note (useful after editing one file while the server isn't running):

vaultex ingest-note "projects/api-rewrite.md" --vault /path/to/your/vault

WSL / Windows vaults: If your vault is on a Windows filesystem (/mnt/c/...), LanceDB cannot write there. Use --data-dir to store the index on the Linux filesystem:

vaultex ingest --vault /mnt/c/Users/you/vault --data-dir ~/.vaultex/my-vault

Starting the MCP Server

For Claude Desktop (client spawns this process over stdio):

vaultex serve --vault /path/to/your/vault

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "vaultex": {
      "command": "vaultex",
      "args": ["serve", "--vault", "/path/to/your/vault"],
      "env": {
        "OPENROUTER_API_KEY": "sk-or-..."
      }
    }
  }
}

For Claude Code or any HTTP-based client (server runs as a standalone process):

vaultex serve --vault /path/to/your/vault --transport sse --port 8765

Then add http://127.0.0.1:8765/sse as a custom MCP connector.

Serve without live watching (read-only, no file monitoring):

vaultex serve --vault /path/to/your/vault --no-watch

Other Commands

Check ingestion status (note count, propositions, errors):

vaultex status --vault /path/to/your/vault

Recompute the soft-link similarity threshold after a large ingest:

vaultex recalculate-threshold --vault /path/to/your/vault --percentile 95

Run an end-to-end smoke test against a random sample of notes (non-destructive, uses a temp directory):

vaultex smoke-test --vault /path/to/your/vault --sample 20

Launch the graph explorer web UI:

vaultex explore --vault /path/to/your/vault --port 7333

MCP Tools Reference

Once the server is running, these six tools are available to any connected client.


semantic_search

Embed a query and return the most similar propositions from the vault.

semantic_search(query: str, top_k: int = 10) → list[dict]
Parameter Default Description
query Natural language query
top_k 10 Number of results (max 50)

Returns a list of:

{
  "id": "sha256-hex",
  "text": "The proposition text.",
  "source_note": "projects/api-rewrite.md",
  "source_section": "## Timeline",
  "similarity_score": 0.51
}

Scores from text-embedding-3-small cluster in the 0.3–0.6 range. A score of 0.45 or above is a strong match. Focus on relative ranking, not absolute values.


get_graph_neighborhood

Traverse the knowledge graph outward from a proposition, following typed edges.

get_graph_neighborhood(
  proposition_id: str,
  max_hops: int = 2,
  max_results: int = 20,
  min_path_score: float = 0.3,
  edge_types: list[str] | None = None
) → dict
Parameter Default Description
proposition_id ID from a semantic_search result
max_hops 2 Traversal depth (1–3)
max_results 20 Max neighbors to return
min_path_score 0.3 Minimum cumulative path score (product of edge weights)
edge_types all Restrict to ["SAME_NOTE"], ["HARD_LINK"], ["SOFT_LINK"], or any combination

Returns:

{
  "origin": {"id": "...", "text": "...", "source_note": "..."},
  "neighbors": [
    {
      "id": "...",
      "text": "...",
      "source_note": "...",
      "path_score": 2.0,
      "path": [{"edge_type": "SAME_NOTE", "weight": 2.0, "via_node_id": "..."}]
    }
  ]
}

For focused lookups, pass edge_types=["SAME_NOTE", "HARD_LINK"] to follow structural connections only. Include "SOFT_LINK" for exploratory queries to surface latent connections.


read_full_note

Read the full markdown content of a note, including parsed frontmatter and outgoing links.

read_full_note(note_path: str) → dict

Returns:

{
  "path": "projects/api-rewrite.md",
  "content": "# API Rewrite\n...",
  "frontmatter": {"status": "in-progress", "tags": ["engineering"]},
  "outgoing_links": ["people/marc.md", "projects/graphql.md"],
  "proposition_count": 14
}

get_note_propositions

Retrieve every proposition extracted from a specific note.

get_note_propositions(note_path: str) → list[dict]

Returns:

[
  {"id": "...", "text": "Marc is the lead engineer on the API rewrite.", "source_section": "## Team"},
  ...
]

Useful as a coverage check — if a central note appears in search results but you want all its indexed facts, use this.


get_ingestion_status

Query the health and coverage of the current index.

get_ingestion_status() → dict

Returns:

{
  "status": "idle",
  "total_notes": 342,
  "total_propositions": 4821,
  "total_graph_edges": 18234,
  "last_ingestion": "2026-03-24T18:45:12+00:00",
  "soft_link_threshold": 0.6143,
  "errors_last_24h": 0
}

deep_research

Automated multi-step retrieval pipeline — query expansion, semantic search, graph traversal, note reading, and LLM synthesis in one call.

deep_research(query: str, depth: str = "standard") → dict
Parameter Default Description
query Natural language question about the vault
depth "standard" "standard" or "thorough"

standard — Haiku synthesis, 10 search hits, 3 graph walks, up to 3 notes read. Fast and cheap. Best for factual lookups, status checks, and routine questions.

thorough — Sonnet synthesis, 20 search hits, 5 graph walks, up to 5 notes read. Slower and more expensive. Best for nuanced analysis, cross-cutting themes, or questions that require judgment.

Returns:

{
  "answer": "Marc is the lead engineer on the API rewrite project...",
  "confidence": "high",
  "factlets": [
    {
      "text": "Marc joined the API rewrite team in January 2026.",
      "source_note": "people/marc.md",
      "source_section": "## Background",
      "discovery_method": "graph_walk"
    }
  ],
  "notes_consulted": ["people/marc.md", "projects/api-rewrite.md"],
  "retrieval_stats": {
    "search_hits": 10,
    "graph_nodes_visited": 43,
    "notes_read": 2,
    "synthesis_model": "anthropic/claude-haiku-4-5",
    "depth": "standard"
  }
}

discovery_method is one of "search", "graph_walk", or "note_read". Facts corroborated by multiple discovery methods carry higher confidence in the synthesized answer.


Deep Search Procedure (Manual)

When you want direct control over retrieval, use the individual tools in sequence. The server also exposes this procedure as a built-in MCP prompt (deep_search_guide).

Step 0 — Expand the query. Short keywords produce weak embeddings.

User query Search with
"Marc" "Marc, the engineer who works on GraphQL and the API rewrite"
"Q3 timeline" "Q3 deadline and project timeline for the current quarter"
"what happened today?" "daily note, tasks completed, meetings, and updates from today"

Step 1 — Semantic search (wide net).

semantic_search(query=<expanded_query>, top_k=10)

Note the source_note values of the top results.

Step 2 — Graph walk (top 3–5 hits).

get_graph_neighborhood(proposition_id=<id>, max_hops=2, min_path_score=0.3)

This is where the real value is. HARD_LINK edges surface connections that embeddings alone miss — the link from "project note" to "person note" can only be found by following [[wikilinks]] in the graph. Deduplicate results across walks.

Step 3 — Read source notes.

read_full_note(note_path=<source_note>)

Gets original prose context, frontmatter tags and fields, and outgoing wikilinks.

Step 4 — Coverage check (optional).

get_note_propositions(note_path=<source_note>)

Pull all propositions from a central note if the first steps didn't surface many from it.

Step 5 — Synthesize. Assemble the answer, citing source notes. Prefer facts corroborated by multiple paths.

A typical deep search takes about 8–10 tool calls total.


The .vaultex-ignore File

Place a .vaultex-ignore file in your vault root to exclude files from indexing and watching. It uses fnmatch glob patterns, one per line:

# Personal notes — never index these
personal/**
diary/*.md

# Specific file types
*.excalidraw.md
*.canvas

# Temporary / draft files
daily/private-*.md

Blank lines and lines beginning with # are treated as comments. The watcher detects changes to .vaultex-ignore and reloads patterns live — no server restart needed.

The exclude_patterns config option applies the same logic but is configured in config.yaml or environment variables, making it suitable for patterns shared across multiple vaults or enforced at a system level.


Data Directory

By default, all Vaultex data is written to .vaultex/ inside the vault:

<vault>/.vaultex/
├── vaultex.db          # SQLite: note registry, config store, ingestion log
├── lancedb/            # LanceDB vector database
├── graph.pkl           # Serialized proposition graph (rustworkx, pickle)
└── config.yaml         # Per-vault config overrides (optional)

The SQLite database has three tables:

  • note_registry — One row per ingested note: path, content hash, last-processed timestamp, proposition count.
  • config — Key/value store for computed values: soft_link_threshold, soft_link_prop_count.
  • ingestion_log — Append-only event log: processed, skipped, error, deleted, threshold_recalc events with timestamps and details.

Override the data directory with --data-dir or the data_dir config option. This is particularly useful on WSL where the vault may be on a Windows filesystem but LanceDB requires a native Linux path.

About

GraphRAG-powered MCP server that transforms an Obsidian vault into a semantically queryable knowledge graph.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors