UltraSearch is a high‑performance, memory‑efficient desktop search engine for Windows. It combines NTFS MFT enumeration (Everything‑style instant filename search) with full‑text content indexing, all in Rust, with a multi‑process architecture that keeps the always‑on service tiny while isolating heavy work in short‑lived worker processes.
What it is
- A Windows desktop search stack: background service + indexing workers + GPU‑accelerated UI.
- Filename search is Everything‑fast, backed by NTFS MFT enumeration and USN journal tailing.
- Content search is full‑text over extracted file contents using Tantivy, with Extractous/IFilter/OCR backends.
Why it’s great
- Instant filename search even on millions of files (MFT + USN; no recursive crawlers).
- Deep content search with proper full‑text indexing instead of “grep‑style” one‑offs.
- Tiny idle footprint: the service runs in tens of MB instead of hundreds.
- Background‑respectful: content indexing only when the machine is truly idle and not under load.
- Transparent internals: status/metrics APIs, structured logs, and clear scheduler behavior.
Why not just use Windows Search / Everything / yet another crawler?
-
Windows Search
-
- Good coverage and content indexing.
- – Slow initial builds, opaque state, heavy resident memory, intrusive background activity.
-
-
Everything
-
- Fantastic filename search via the USN journal and MFT.
- – No content indexing at all.
-
-
Typical recursive crawlers
- – Walk directory trees, opening handles and reading metadata one directory at a time.
- – Scale poorly with depth and can’t fully exploit NTFS internals.
UltraSearch goal: Combine Everything‑style filename speed with Windows Search‑style content coverage, but with a service that behaves like a well‑designed daemon (low memory, cooperative background behavior, strong observability) instead of a monolithic desktop app.
This section is for people who just want to install and use UltraSearch. The rest of the README is a deep architectural dive.
Download (Windows) Get the latest UltraSearch installer from the Releases page: https://github.com/Dicklesworthstone/ultrasearch/releases/latest
When releases are published, you’ll see a .exe installer asset there. Download that file (e.g. UltraSearch-Setup-x.y.z.exe).
-
Double‑click the downloaded installer.
-
Follow the on‑screen prompts:
- Install the UltraSearch Service (runs in the background).
- Install the UltraSearch UI (start menu + tray).
-
The installer will:
- Register the Windows service.
- Create start menu entries.
- Optionally configure “Start with Windows” for the UI tray app.
Once installation completes:
-
Start the UI:
- From the Start menu (
UltraSearch), or - From the system tray (if auto‑start enabled).
- From the Start menu (
-
Trigger Spotlight‑style search with
Alt+Space:- A floating palette appears.
- Type to search filenames and, once indexing is ready, content.
-
Hit Enter to open the selected file/folder, or use context actions from the result row.
-
Initial metadata build runs quickly using direct MFT enumeration.
-
Content indexing runs only when the system is idle and not busy:
- Short‑lived worker processes extract and index content.
- The always‑on service remains small and responsive.
Use standard Windows mechanisms:
- Settings → Apps → Installed apps → UltraSearch → Uninstall, or
- Control Panel → Programs → Programs and Features → UltraSearch → Uninstall
The uninstaller removes the service, UI, and associated program files. Index/state directories under %PROGRAMDATA%\UltraSearch may be preserved for faster re‑installs, depending on installer options.
Recent UI/UX work focuses on making UltraSearch feel like a modern, lightweight desktop companion rather than “a big app you sometimes open”.
-
Spotlight‑style Quick Search (
Alt+Space)-
Floating palette with:
- Search‑as‑you‑type filename search via the meta index.
- Inline query highlighting in results.
- Recent history surfaced for quick re‑runs.
-
Fully keyboard‑driven: arrows to navigate, Enter to open, Esc to dismiss.
-
-
Keyboard shortcuts overlay / Help panel (
F1,Ctrl+/,Cmd+/)- Shows a grouped cheatsheet of global + in‑app shortcuts.
- Includes tray / update tips and power‑user tricks.
- Accessible both via shortcuts and a header “Help” chip in the UI.
- Dismissible via Esc or clicking outside.
-
Tray awareness
-
Tray tooltip reflects Idle / Indexing / Update available / Offline.
-
Update panel supports:
- Opt‑in update checks.
- Check → download → restart flow.
- Release notes surfaced inline.
-
-
GraalVM guard for Extractous
content-extractor’sbuild.rsenforces GraalVM CE 23.x when theextractous_backendfeature is enabled.- Setup details live in
docs/GRAALVM_SETUP.md.
-
IPC self‑healing
-
Named‑pipe client uses retry with backoff.
-
Handles:
- Service restarts.
- Temporary pipe‑busy states.
- Connection races during boot / upgrade.
-
-
Status metrics
- Queue depth, active workers, content jobs enqueued/dropped are surfaced via status/metrics responses.
- The UI and external tools can query these to show progress and diagnose issues.
For a concise feature overview with links to docs and setup instructions, see docs/FEATURES.md.
Most desktop search tools are monolithic: one big process that discovers files, indexes them, extracts content, answers queries, and renders UI. UltraSearch deliberately splits this into specialized pieces:
-
A tiny Windows service that:
- Enumerates NTFS MFTs and tails USN journals.
- Maintains the metadata index for instant filename search.
- Manages scheduling and job queues.
-
Short‑lived worker processes that:
- Extract content with Extractous/IFilter/OCR.
- Write to the content index.
- Exit immediately after committing, releasing all heavy allocations.
-
A GPU‑accelerated UI process that:
- Speaks to the service over named pipes.
- Renders virtualized result lists and previews.
- Provides a modern, responsive UX.
The design goal: service stays tiny and predictable, while heavy work is isolated, bounded, and disposable.
Desktop search tools have historically forced a choice between speed and completeness.
Windows Search (built‑in indexing) offers comprehensive content coverage but has issues:
- Slow initial index builds (hours or days on large filesystems).
- High idle memory usage (hundreds of MB resident).
- Background activity that can visibly affect system responsiveness.
- Opaque indexing state; debugging “why didn’t it find X?” is hard.
- Limited query expressiveness and ranking control.
Everything shows that instant filename search is possible:
- Sub‑10ms latency for filename queries even on millions of files.
- Very low memory footprint (tens of MB).
- Real‑time change detection via the USN journal.
…but it intentionally doesn’t do content indexing.
Many “custom search” tools rely on recursive directory crawlers:
- Walk directory trees using
FindFirstFile/FindNextFile. - Cannot exploit NTFS’s centralized MFT structure.
- Miss files inaccessible via standard directory APIs.
- Scale poorly: complexity proportional to directory depth and breadth.
UltraSearch bridges these gaps by designing for NTFS and Windows specifically, rather than pretending all filesystems are the same:
-
NTFS‑native indexing
- Direct MFT enumeration (via
usn-journal-rs) instead of recursive traversal. - Achieves 100k–1M files/sec enumeration on modern SSDs.
- Direct MFT enumeration (via
-
Process isolation
- Heavy content extraction and indexing occur in short‑lived worker processes.
- Workers exit after finishing a batch, returning the system to a minimal baseline.
-
Unified search engine
- Both metadata and content indices are powered by Tantivy, a Rust search engine.
- Consistent scoring, query expressiveness, and field semantics across modes.
-
Intelligent scheduling
- Background indexing respects user idle time and system load.
- Content jobs only run when the machine can spare the resources.
-
Memory efficiency
- Memory‑mapped Tantivy segments.
- Zero‑copy serialization for state.
- Bounded allocations for writers and extractors.
Result: Everything‑style speed for filenames plus full‑text content search, with a background service that behaves like a good citizen.
UltraSearch uses a multi‑process architecture: one service, many short‑lived workers, plus a UI process.
The diagram below shows the major processes and their responsibilities.
graph TB
subgraph "Service Process (searchd)"
direction TB
SVC[Service Main<br/>Windows Service Host]
VOL[Volume Discovery<br/>NTFS Volume Enumeration]
MFT[MFT Enumerator<br/>Master File Table Scanner]
USN[USN Watcher<br/>Change Journal Tailer]
SCHED[Scheduler Runtime<br/>Idle Detection & Job Queue]
IPC[IPC Server<br/>Named Pipe Endpoint]
META_IDX[Meta Index Reader<br/>Tantivy IndexReader]
STATUS[Status Provider<br/>Metrics & Health]
end
subgraph "Worker Process (search-index-worker)"
direction TB
WORKER[Worker Main<br/>Batch Processor]
EXTRACT[Content Extractor<br/>Extractous/IFilter/OCR]
CONTENT_IDX[Content Index Writer<br/>Tantivy IndexWriter]
LIMITS[Resource Limits<br/>Memory & CPU Caps]
end
subgraph "UI Process (search-ui)"
direction TB
UI[GPUI Application<br/>Desktop Client]
UI_IPC[IPC Client<br/>Async Pipe Handler]
RESULTS[Results Table<br/>Virtualized List]
PREVIEW[Preview Pane<br/>Syntax Highlighting]
end
subgraph "Storage Layer"
direction LR
META_DISK[(Meta Index<br/>Tantivy Segments)]
CONTENT_DISK[(Content Index<br/>Tantivy Segments)]
STATE[Volume State<br/>rkyv Archives]
CONFIG[Configuration<br/>TOML Files]
end
VOL -->|Discovers| MFT
MFT -->|Streams FileMeta| META_IDX
USN -->|Publishes Events| SCHED
SCHED -->|Spawns| WORKER
EXTRACT -->|Writes Docs| CONTENT_IDX
IPC -->|Queries| UI_IPC
META_IDX -->|Reads| META_DISK
CONTENT_IDX -->|Writes| CONTENT_DISK
MFT -->|Persists| STATE
USN -->|Updates| STATE
STATUS -->|Exposes| IPC
Key points:
- The service owns all NTFS handles, indexing decisions, and the metadata index.
- Workers are spun up and torn down by the scheduler to process content jobs.
- The UI talks only to the service via IPC; it never touches the filesystem directly for search.
This diagram separates the initial bootstrapping from normal operation.
graph LR
subgraph "Startup Phase"
A[Service Starts] --> B[Discover Volumes]
B --> C[Enumerate MFT]
C --> D[Build Meta Index]
D --> E[Start USN Watcher]
E --> F[Start Scheduler]
F --> G[Ready for Queries]
end
subgraph "Runtime Phase"
G --> H{User Query?}
H -->|Yes| I[IPC Handler]
I --> J[Search Handler]
J --> K[Query Index]
K --> L[Return Results]
G --> M{File Changed?}
M -->|Yes| N[USN Event]
N --> O{Scheduler Check}
O -->|Idle| P[Spawn Worker]
P --> Q[Extract Content]
Q --> R[Update Content Index]
R --> S[Worker Exits]
end
-
Startup: full MFT enumeration + metadata index build, USN watcher, scheduler.
-
Runtime:
- Queries are handled immediately using the metadata/content indices.
- File system changes stream in via USN; scheduler decides when to spawn workers.
sequenceDiagram
participant NTFS as NTFS Volume
participant MFT as MFT Enumerator
participant USN as USN Watcher
participant SCHED as Scheduler
participant WORKER as Index Worker
participant META_IDX as Meta Index
participant CONTENT_IDX as Content Index
Note over NTFS,META_IDX: Initial Build Phase
NTFS->>MFT: Enumerate MFT Records
MFT->>META_IDX: Stream FileMeta batches
META_IDX->>META_IDX: Build Tantivy docs
META_IDX->>META_IDX: Commit segments
Note over USN,CONTENT_IDX: Incremental Update Phase
NTFS->>USN: USN Journal Records
USN->>SCHED: FileEvent (Created/Modified/Deleted)
SCHED->>SCHED: Check Idle State & System Load
SCHED->>WORKER: Spawn worker with job batch
WORKER->>WORKER: Extract content (Extractous/IFilter)
WORKER->>CONTENT_IDX: Write extracted documents
WORKER->>CONTENT_IDX: Commit index
WORKER->>WORKER: Exit (release memory)
The metadata index is built in a single streaming pass over the MFT; the content index is incrementally updated based on USN change events and the scheduler’s decisions.
sequenceDiagram
participant UI as UI Client
participant IPC as IPC Server
participant HANDLER as Search Handler
participant META as Meta Index
participant CONTENT as Content Index
UI->>IPC: SearchRequest (query AST, mode, limit)
IPC->>HANDLER: Dispatch request
alt SearchMode::NameOnly
HANDLER->>META: Build query (name/path fields)
META->>HANDLER: Search results (DocKey, score)
else SearchMode::Content
HANDLER->>CONTENT: Build query (content field)
CONTENT->>HANDLER: Search results (DocKey, score)
else SearchMode::Hybrid
par Parallel Execution
HANDLER->>META: Query metadata index
HANDLER->>CONTENT: Query content index
end
META->>HANDLER: Meta results
CONTENT->>HANDLER: Content results
HANDLER->>HANDLER: Merge by DocKey (max score)
HANDLER->>HANDLER: Sort by score, apply limit
end
HANDLER->>IPC: SearchResponse (hits, total, took_ms)
IPC->>UI: Return response
UI->>UI: Render results table
Modes:
- NameOnly – Everything‑style filename search via the metadata index.
- Content – pure full‑text search via the content index.
- Hybrid – merge of both, with score fusion by
DocKey.
stateDiagram-v2
[*] --> Active: Service Start
Active --> WarmIdle: Idle >= 15s
WarmIdle --> Active: User Input
WarmIdle --> DeepIdle: Idle >= 60s
DeepIdle --> Active: User Input
DeepIdle --> DeepIdle: Content Indexing Allowed
Active --> Active: Critical Updates Only
WarmIdle --> WarmIdle: Metadata Updates
DeepIdle --> DeepIdle: Full Indexing
- Active – user interacting: only cheap, critical updates.
- WarmIdle – short idle: metadata maintenance allowed.
- DeepIdle – sustained idle: full content indexing allowed, subject to system load constraints.
UltraSearch’s data model is designed around efficient storage, fast queries, and stable correlation between metadata and content indices.
Identifiers are chosen to align with NTFS and to pack efficiently into 64‑bit keys.
A small integer assigned at runtime that maps to NTFS volume GUIDs and drive letters.
┌─────────────────────────────────────────────────────────┐
│ VolumeId: u16 │
├─────────────────────────────────────────────────────────┤
│ Range: 0..65535 │
│ Assignment: Sequential at service startup │
│ Persistence: Stored in volume state files │
│ Use Cases: Per-volume filtering, isolation │
└─────────────────────────────────────────────────────────┘
Design rationale
- 16 bits → 65,535 possible volumes, far beyond practical needs.
- Tiny type → efficient packing into composite keys.
- Sequential assignment → simple volume bookkeeping.
- Runtime assignment → supports dynamic volume discovery.
The 64‑bit NTFS File Reference Number (FRN) from the MFT.
┌─────────────────────────────────────────────────────────┐
│ FileId: u64 │
├─────────────────────────────────────────────────────────┤
│ Bits 0-47: File Reference Number (FRN) │
│ Bits 48-63: Sequence Number │
│ Stability: Persists across renames (same volume) │
│ Detection: Sequence number detects stale references │
└─────────────────────────────────────────────────────────┘
Design rationale
- FRN is NTFS’s native file identifier; it directly indexes into the MFT.
- Sequence number lets us detect deleted or reused entries.
- 48‑bit FRN supports absurdly many files per volume (281T theoretical, ~4B practical).
- Stable across renames → index updates don’t require path walks.
A composite key encoding (VolumeId, FileId) into a single 64‑bit value.
┌─────────────────────────────────────────────────────────┐
│ DocKey: u64 │
├─────────────────────────────────────────────────────────┤
│ Bits 0-47: FileId (FRN + sequence) │
│ Bits 48-63: VolumeId │
│ Format: "{volume}:0x{frn_hex}" for debugging │
│ Use: Primary key for all index operations │
└─────────────────────────────────────────────────────────┘
Implementation
impl DocKey {
pub const fn from_parts(volume: VolumeId, file: FileId) -> Self {
let packed = ((volume as u64) << 48) | (file & 0x0000_FFFF_FFFF_FFFF);
DocKey(packed)
}
pub const fn into_parts(self) -> (VolumeId, FileId) {
let volume = (self.0 >> 48) as VolumeId;
let file = self.0 & 0x0000_FFFF_FFFF_FFFF;
(volume, file)
}
}Design rationale
- Single 64‑bit key → efficient storage, comparison, and hashing.
- Facilitates fast field filters in Tantivy (e.g. per‑volume filtering).
- Human‑readable formatting helps debugging/logging.
- Enables range queries scoped to volume ranges.
Trade‑offs considered
- Separate fields – more storage and more complex queries. Rejected.
- String keys – huge memory and comparison overhead. Rejected.
- 128‑bit UUIDs – bigger and lose volume locality. Rejected.
UltraSearch maintains two indices:
- Metadata index (for filenames and attributes).
- Content index (for full‑text content).
Each has its own schema and update cadence.
Optimized for filename and attribute queries with minimal storage overhead.
┌─────────────────────────────────────────────────────────┐
│ Metadata Document Schema │
├─────────────────────────────────────────────────────────┤
│ doc_key: u64 FAST + STORED Primary key │
│ volume: u16 FAST Per-volume filtering │
│ name: TEXT Indexed Filename search │
│ path: TEXT Indexed Path search │
│ ext: STRING FAST Extension filter │
│ size: u64 FAST Size range queries │
│ created: i64 FAST Creation time │
│ modified: i64 FAST Modification time │
│ flags: u64 FAST Attribute bitfield │
└─────────────────────────────────────────────────────────┘
Field details
-
doc_key– primary key; FAST field for efficient deletes/updates and filtering. -
volume– FAST field for per‑volume scoping. -
name- Tokenized on
[\ / . - _]. - Supports partial and prefix matches on filenames.
- Tokenized on
-
path- Tokenized on directory separators.
- Optional because path can be reconstructed, but useful for ranking/snippets.
-
ext- STRING + FAST for exact extension filters (
ext:pdf,ext:rs).
- STRING + FAST for exact extension filters (
-
sizeu64for size range queries up to multi‑exabyte files.
-
created/modified- Unix timestamps; FAST fields for range filters (e.g. “modified last 7 days”).
-
flags-
Bitfield encoding attributes like:
IS_DIR,HIDDEN,SYSTEM,ARCHIVE,REPARSE,OFFLINE,TEMPORARY.
-
Typical document size
- ~100–500 bytes per file (path length dominates).
Update frequency
- Updated immediately on file system changes via the USN journal.
Full‑text content extracted from files, updated only in background.
┌─────────────────────────────────────────────────────────┐
│ Content Document Schema │
├─────────────────────────────────────────────────────────┤
│ doc_key: u64 FAST + STORED Correlates with meta │
│ volume: u16 FAST Per-volume filtering │
│ name: TEXT Indexed Name boost │
│ path: TEXT STORED Snippet context │
│ ext: STRING FAST Format filter │
│ size: u64 FAST Size filter │
│ modified: i64 FAST Recency boost │
│ content_lang: STRING STORED Analyzer selection │
│ content: TEXT Indexed Full-text search │
└─────────────────────────────────────────────────────────┘
Field details
doc_key– correlation key to metadata index.content– main full‑text field, with language‑appropriate analyzer.content_lang– ISO‑639‑1 language code; used to pick analyzers.name/path– used to boost ranking for filename/path matches.
Typical document size
- Varies widely; content truncated at configurable limits (~16–32 MiB, ~100–200k chars).
Update frequency
- Only during background indexing, when the scheduler decides conditions are safe.
This is one of the key design decisions.
Alternative 1 – single unified index
-
Pros:
- Single index, single query path.
-
Cons:
- Content updates touch a large index.
- Service must keep content reader open (higher memory baseline).
- Can’t “turn off” content indexing without impacting metadata.
-
Verdict: Rejected due to mismatched update patterns and memory impact.
Alternative 2 – three indices (meta, content, combined)
-
Pros:
- Flexible routing/optimizations.
-
Cons:
- Data duplication and higher storage cost.
- Consistency challenges across three indices.
- Complex update logic.
-
Verdict: Rejected as over‑complex.
Chosen approach – two separate indices
Benefits:
-
Update frequency mismatch
- Metadata changes often (renames, moves, flags).
- Content changes less often; sometimes never.
- Separate indices let us update metadata cheaply without touching content.
-
Query patterns
- Filename searches are much more common than content searches.
- A lean metadata index keeps common queries extremely fast.
-
Memory footprint
- Service can operate with only the meta index loaded by default.
- Content index readers are opened lazily as needed.
-
Indexing strategy
- Metadata indexing is cheap and can run in the service.
- Content extraction is heavy and moved into worker processes.
-
Failure isolation
- Content index issues don’t impair filename search.
- Rebuilding content index doesn’t require touching metadata.
Trade‑offs:
- Result merging – hybrid queries must merge results from two indices (by
DocKey). - Consistency – two stores can diverge; mitigated by
DocKeyas the single source of truth and atomic per‑index updates.
UltraSearch is deeply NTFS‑specific by design.
At service startup (and when volumes change), UltraSearch discovers NTFS volumes.
Simplified algorithm
// Pseudocode, not exact implementation:
1. GetLogicalDrives() → bitmask of drive letters
2. For each drive letter:
a. GetVolumeInformationW() → filesystem type
b. Filter to "NTFS"
c. GetVolumeNameForVolumeMountPointW() → volume GUID path
d. Track mapping: drive letters ↔ GUID
3. Assign VolumeId sequentially
4. Persist mapping in volume state filesVolume GUID paths vs drive letters
UltraSearch uses volume GUID paths like \\?\Volume{GUID}\ instead of C:\:
- Stable across reboots and letter reassignments.
- Support mount points where multiple letters map to a single volume.
- Volumes may exist without any drive letter.
- GUIDs are globally unique.
Example:
VolumeId: 1
GUID Path: \\?\Volume{12345678-1234-1234-1234-123456789abc}\
Drive Letters: ["C:", "D:"]
The MFT is NTFS’s global table of file records. UltraSearch enumerates it directly via usn-journal-rs.
Highly simplified MFT record layout
┌─────────────────────────────────────────────────────────┐
│ MFT Record (≈1024 bytes) │
├─────────────────────────────────────────────────────────┤
│ Header: Record number, sequence, flags │
│ Attributes: │
│ - $STANDARD_INFORMATION: timestamps, attributes │
│ - $FILE_NAME: filename, parent FRN │
│ - $DATA: file data / streams │
│ - $BITMAP: allocation info │
└─────────────────────────────────────────────────────────┘
Enumeration steps
-
Open a volume handle:
CreateFileW( volume_guid_path, FILE_READ_ATTRIBUTES | FILE_READ_DATA | FILE_LIST_DIRECTORY, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, ... )
-
Use
usn-journal-rsto iterate MFT records:- Extract FRN, parent FRN.
- Read file name, flags, size, timestamps.
- Filter out system/inaccessible entries per configuration.
- Resolve FRN→path via parent graph with a small LRU cache.
-
Stream
FileMetainto the metadata index builder:- No “whole filesystem in memory” structure.
- Periodic commits (e.g. every 100k docs or 30s).
Performance characteristics
- Enumeration rate: ~100k–1M files/sec (disk‑dependent).
- Memory usage: low; the design is fully streaming.
- Path resolution: assisted by
usn-journal-rshelpers + LRU cache.
Why MFT enumeration instead of directory traversal?
| Approach | Complexity | Speed | Coverage | Memory |
|---|---|---|---|---|
| MFT enumeration | O(n files) | 100k–1M/s | Complete (even system/inaccessible) | Very low |
| Directory traversal | O(n × depth) | 1k–10k/s | Incomplete (miss hidden/system) | Higher (trees) |
- MFT access is designed for exactly this type of global scan.
- Directory walkers are convenient but fundamentally limited.
The USN Change Journal is NTFS’s append‑only record of filesystem changes. UltraSearch tails it per volume.
Journal metadata (simplified)
┌─────────────────────────────────────────────────────────┐
│ USN Journal Data │
├─────────────────────────────────────────────────────────┤
│ UsnJournalID: unique journal ID │
│ FirstUsn: oldest record │
│ NextUsn: next USN to be assigned │
│ LowestValidUsn: oldest valid record │
│ MaxUsn: maximum possible USN │
└─────────────────────────────────────────────────────────┘
Architecture
- One watcher thread per volume in the service.
- Uses
FSCTL_QUERY_USN_JOURNAL/FSCTL_READ_USN_JOURNALviausn-journal-rs. - Stores
(journal_id, last_usn)per volume inVolumeState.
Event model
pub enum FileEvent {
Created { usn: u64, frn: u64, name: String, parent_frn: u64 },
Deleted { usn: u64, frn: u64 },
Modified { usn: u64, frn: u64 },
Renamed { usn: u64, frn: u64, old_name: String, new_name: String, parent_frn: u64 },
BasicInfoChanged { usn: u64, frn: u64 }, // attrs, timestamps, ACLs
}Handling journal gaps
On startup, UltraSearch compares persisted state to current journal:
- If
journal_idchanged → journal recreated (e.g. volume reformat). - If
last_usnis outside[FirstUsn, NextUsn]→ journal wrapped. - In both cases, volume is marked stale and scheduled for incremental rescan.
Why USN instead of ReadDirectoryChangesW?
| Feature | USN Journal | ReadDirectoryChangesW |
|---|---|---|
| Completeness | Total history | Can lose events (buffer overflow) |
| Persistence | Survives reboot | Volatile |
| Handles | Single per volume | One per watched directory tree |
| Missed events on down | Recoverable | Permanently lost |
| High change rates | Robust | Buffer overflow, dropped notifications |
USN is simply the correct primitive for robust incremental indexing.
UltraSearch uses Tantivy 0.24.x as its search engine, chosen after comparing Rust and non‑Rust alternatives.
Comparison snapshot:
| Engine | Language | Performance | Memory | Rust integration | Notes |
|---|---|---|---|---|---|
| Tantivy | Rust | Excellent | Efficient | Native | Actively maintained |
| Lucene | Java | Excellent | Higher | FFI required | De‑facto standard |
| Bleve | Go | Good | Moderate | FFI required | Mature Go option |
| Meilisearch | Rust | Good | Higher | Native | Bundled server |
Decision factors:
-
Rust‑native implementation
- No FFI boundary, single toolchain, shared ecosystem.
-
Performance
- Competitive with Lucene in many workloads.
-
Memory efficiency
- CompactDoc format is well‑suited to large indices.
-
Feature set
- Fast fields, scoring models, analyzers, and range queries.
-
Active maintenance
- Modern Rust style; regularly updated.
Both indices use CompactDoc+fast fields aggressively.
doc_key: u64 // FAST | STORED
volume: u16 // FAST
name: TEXT // custom tokenizer for filenames
path: TEXT // optional, tokenized by separators
ext: STRING // FAST
size: u64 // FAST
created: i64 // FAST
modified: i64 // FAST
flags: u64 // FASTField types
- FAST – columnar storage; used heavily for filters and range queries.
- STORED – retrievable fields, not indexed; mostly for IDs.
- TEXT – tokenized, full‑text searchable.
- STRING – raw keyword, exact match.
Filename tokenization
Example:
Input: "my-document_v2.final.pdf"
Tokens: ["my", "document", "v2", "final", "pdf", "my-document_v2.final.pdf"]
This supports:
- Prefix and partial matches (
"doc"matches"document"). - Whole‑filename queries (last token).
- Better recall without bloating the index.
doc_key: u64 // FAST | STORED
volume: u16 // FAST
name: TEXT // boosts ranking
path: TEXT // stored; snippet context
ext: STRING // FAST
size: u64 // FAST
modified: i64 // FAST
content_lang: STRING // stored
content: TEXT // main full-text fieldAnalyzers
- Default: English analyzer (tokenization + stopwords + stemming).
- Future: per‑language analyzers based on
content_lang. - Further future: document‑type‑aware analyzers (code, logs, etc.).
The indexing stack is engineered to keep memory usage predictable and low.
Tantivy segments and certain state blobs use memmap2:
use memmap2::{Mmap, MmapOptions};
use std::fs::File;
pub struct MappedIndex {
mmap: Mmap,
// ...
}
impl MappedIndex {
pub fn from_file(path: &Path) -> Result<Self> {
let file = File::open(path)?;
let mmap = unsafe { MmapOptions::new().map(&file)? };
Ok(Self { mmap })
}
}Benefits:
- OS‑level page cache decides what stays resident.
- Zero‑copy reads.
- Memory usage scales with working set, not total index size.
- Natural behavior under memory pressure.
Volume state and snapshots are stored using rkyv:
use rkyv::{Archive, Deserialize, Serialize};
#[derive(Archive, Serialize, Deserialize)]
pub struct VolumeState {
pub volume_id: VolumeId,
pub last_usn: u64,
pub journal_id: u64,
// ...
}
// Serialize
let bytes = rkyv::to_bytes::<_, 256>(&state)?;
// Zero-copy read
let archived = unsafe { rkyv::archived_root::<VolumeState>(&bytes[..]) };Characteristics:
- 2–5x smaller than JSON.
- 10–100x faster to deserialize.
- Zero‑copy reads (no reallocation).
The content index writer lives only in the worker process:
let writer_config = WriterConfig {
heap_size: 64 * 1024 * 1024, // 64 MB
num_threads: 2,
// ...
};
let writer = index.writer_with_config(writer_config)?;Benefits:
- Worker exit ⇒ all writer allocations vanish.
- Bound per‑worker memory via
heap_size. - Multiple workers can run without starving the system.
Trade‑off: more frequent commits/merges, but far more predictable memory behavior.
The service maintains lightweight readers:
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::Manual)
.num_warmers(0)
.docstore_cache_size(10 * 1024 * 1024)
.build()?;Rationale:
- Manual reload – avoid automatic reload storms; service explicitly reloads on commits.
- Zero warmers – minimal overhead at startup; rely on OS cache.
- Small docstore cache – keep reader memory small and rely on mmap.
WriterConfig {
heap_size: 512 * 1024 * 1024, // up to 512 MB
num_threads: min(8, num_cpus()),
merge_policy: LogMergePolicy {
target_segment_size: 256 * 1024 * 1024,
max_merged_segment_size: 1024 * 1024 * 1024,
},
}- Aggressively parallelized and memory‑heavy only during initial build.
- Large segments reduce merge overhead later.
WriterConfig {
heap_size: 64 * 1024 * 1024, // 64–256 MB configurable
num_threads: 2, // small, predictable
merge_policy: LogMergePolicy {
target_segment_size: 128 * 1024 * 1024,
max_merged_segment_size: 256 * 1024 * 1024,
},
}Why bounded heap?
- Prevents a single worker from blowing up memory.
- Supports multiple concurrent workers safely.
- Makes worst‑case behavior more predictable.
UltraSearch’s content extractor stack is designed to be pluggable, ordered, and resource‑bounded.
Core traits:
pub trait Extractor: Send + Sync {
fn name(&self) -> &'static str;
fn supports(&self, ctx: &ExtractContext) -> bool;
fn extract(&self, ctx: &ExtractContext, key: DocKey) -> Result<ExtractedContent>;
}
pub struct ExtractorStack {
backends: Vec<Box<dyn Extractor + Send + Sync>>,
}Flow:
- Iterate
backendsin order. - First backend where
supports()returns true handles the file. - If extraction fails, error bubbles up (no cross‑backend fallback cascade).
- Return text, language hints, truncation flags, bytes processed.
Why ordered fallback?
- Try cheaper/faster extractors first (plain text, lightweight parsers).
- Reserve heavyweight Extractous/IFilter/OCR for when really needed.
- Let highly specific extractors win over more generic ones.
Extractous is the primary backend; it exposes Apache Tika functionality via Rust.
Supported families
- Documents:
pdf,docx,xlsx,pptx,rtf. - Web/markup:
html,xml,xhtml,md,rst. - Archives:
zip(with internal whitelist). - Data:
csv,json,jsonl. - E‑books:
epub. - Misc: many generic text formats.
Usage sketch:
let engine = ExtractousEngine::new()
.set_extract_string_max_length(max_chars as i32);
let (text, metadata) = engine.extract_file_to_string(path)?;Why Extractous?
- Rust‑native: no separate Java runtime or FFI bridging.
- Tika‑level format coverage.
- Strong performance and memory characteristics vs Python/CLI‑based chains.
On Windows, system‑installed IFilters can be used for certain file types.
When:
- Extractous doesn’t support the format.
- A proprietary IFilter offers better fidelity (e.g., some Office variants).
- Users explicitly enable
ifiltersupport.
Sketch:
#[cfg(windows)]
use windows::Win32::System::Com::{IPersistFile, IFilter};
pub struct IFilterExtractor { /* COM wrappers */ }
impl Extractor for IFilterExtractor {
fn extract(&self, ctx: &ExtractContext, key: DocKey) -> Result<ExtractedContent> {
let filter = LoadIFilter(ctx.path)?;
// Iterate chunks via GetChunk/GetText...
Ok(ExtractedContent { /* ... */ })
}
}Trade‑offs:
- Pros: leverages existing system filters.
- Cons: COM lifetime rules, STA threading, platform‑specific complexity.
- Exposed behind feature flags, not mandatory.
For image‑only PDFs or actual image files, UltraSearch can optionally use Tesseract OCR.
Strategy:
-
Attempt normal extraction first.
-
If result is empty/near‑empty but file type suggests a scan:
- Run Tesseract for up to
ocr_max_pages. - Merge recognized text into
content. - Set
content_langbased on OCR detection.
- Run Tesseract for up to
Sketch:
pub struct OCRExtractor {
engine: TesseractEngine,
max_pages: u64,
}
impl Extractor for OCRExtractor {
fn extract(&self, ctx: &ExtractContext, key: DocKey) -> Result<ExtractedContent> {
if self.is_image_only(ctx.path)? {
let text = self.engine.extract_pages(ctx.path, self.max_pages)?;
Ok(ExtractedContent {
text,
content_lang: Some(self.engine.detect_language(&text)?),
// ...
})
} else {
Err(ExtractError::Unsupported("not image-only".into()))
}
}
}All OCR work is subject to global resource limits (see below).
To keep workers bounded:
-
Per‑document limits
max_bytes_per_file(default: 16–32 MiB).max_chars_per_file(default: 100–200k).
-
Behavior when exceeded
- Truncate on UTF‑8 boundaries.
- Flag document as truncated.
-
Archive policies
- Only index whitelisted formats inside archives.
- Cap recursion depth.
- Skip encrypted archives.
-
Streaming
- Where possible, process data in streaming fashion rather than slurping everything into RAM.
Motivation:
- Prevent single pathological files from destabilizing workers.
- Make worst‑case memory usage predictable.
- Accept a controlled amount of truncation for massive files.
The scheduler decides when and how much work gets done, based on UI idle time and system load.
UltraSearch uses GetLastInputInfo to derive “how long since the last user input”.
Sketch:
pub struct IdleTracker {
warm_idle: Duration,
deep_idle: Duration,
// ...
}
impl IdleTracker {
pub fn sample(&mut self) -> IdleSample {
let idle_for = self.get_idle_duration()?;
let state = self.classify_idle(idle_for);
IdleSample { state, idle_for, since_state_change }
}
fn get_idle_duration(&self) -> Option<Duration> {
#[cfg(windows)]
{
let last_input = GetLastInputInfo()?;
let now = GetTickCount64();
Some(Duration::from_millis((now - last_input) as u64))
}
}
}Default thresholds:
Active:< 15s.WarmIdle: 15–60s.DeepIdle:> 60s.
These are configurable.
Alternative heuristics considered:
- Windows background mode (
PROCESS_MODE_BACKGROUND_BEGIN): rejected due to aggressive working‑set trimming causing paging. - CPU‑only heuristics: insufficient; user can be idle while other workloads are active.
- Screen savers: too coarse and not reliable in modern Windows.
The scheduler also samples system load (via sysinfo and Windows performance counters).
Example metrics:
pub struct SystemLoad {
pub cpu_percent: f32,
pub mem_used_percent: f32,
pub disk_busy: bool,
pub disk_bytes_per_sec: u64,
pub sample_duration: Duration,
}Typical thresholds:
- CPU < 20% → content indexing OK (in DeepIdle).
- CPU 20–50% → metadata‑only work.
- CPU > 50% → pause all heavy indexing.
- Disk busy flag based on measured
Disk Bytes/sec.
Disk I/O sampling often uses PDH counters like \\PhysicalDisk(_Total)\\Disk Bytes/sec.
Jobs are split into distinct queues with different policies.
1. Critical updates (cheap)
- Delete events, basic renames, attribute changes.
- Always processed, even in Active state.
- Sub‑millisecond per job.
2. Metadata rebuilds (moderate)
- Volume rescan after USN gap or journal reset.
- Directory subtree reindex after config change.
- Allowed in WarmIdle+.
3. Content indexing (expensive)
- New/changed files needing extraction.
- Allowed only in DeepIdle and low CPU/disk load.
- Per‑batch worker processes.
Scheduler tick:
pub async fn tick(&mut self) {
let idle_sample = self.idle.sample();
let load = self.load.sample();
self.update_status(idle_sample, load);
let allow_content = self.allow_content_jobs(idle_sample, load);
if allow_content && !self.content_jobs.is_empty() {
let batch = self.pop_batch(self.config.content_batch_size);
self.spawn_worker(batch).await?;
}
}Why separate queues?
- Different SLAs per job type.
- Critical updates must not be starved by heavy jobs.
- Content jobs can be deferred without user‑visible regressions in filename search.
Service process
- Runs at normal process priority.
- USN watcher and scheduler threads typically at
BELOW_NORMALthread priority. - Avoids Windows background process mode due to aggressive working‑set clamping.
Worker processes
- Run at
IDLEorBELOW_NORMALpriority. - May adjust I/O priority per file handle.
- Can be placed in a job object to enforce limits.
Rationale: the service must stay responsive; workers are pure background.
UltraSearch uses Windows named pipes for IPC between the service, UI, and CLI.
- Named pipes via
tokio::net::windows::named_pipeon the server side. - Length‑prefixed messages:
[u32 length][payload bytes]. - Multiple concurrent client connections.
Why named pipes?
| Transport | Latency | Overhead | Complexity | Local Windows support |
|---|---|---|---|---|
| Named pipes | Low | Low | Simple | First‑class |
| TCP sockets | Higher | Higher | Medium | Good, but needs ports |
| Shared memory | Lowest | Very low | Complex | Manual protocols |
| Temp files | High | High | Simple | Awkward for IPC |
Named pipes are the natural fit for local Windows IPC: fast, well‑integrated, and firewall‑agnostic.
SearchRequest
pub struct SearchRequest {
pub id: Uuid,
pub query: QueryExpr, // parsed AST
pub limit: u32,
pub mode: SearchMode, // NameOnly, Content, Hybrid, Auto
pub timeout: Option<Duration>,
pub offset: u32, // pagination offset
}SearchResponse
pub struct SearchResponse {
pub id: Uuid, // matches request
pub hits: Vec<SearchHit>,
pub total: u64,
pub truncated: bool,
pub took_ms: u32,
pub served_by: Option<String>, // host identity for debugging
}Query AST
pub enum QueryExpr {
Term(TermExpr),
Range(RangeExpr),
Not(Box<QueryExpr>),
And(Vec<QueryExpr>),
Or(Vec<QueryExpr>),
}
pub struct TermExpr {
pub field: Option<FieldKind>, // name, path, ext, content, etc.
pub value: String,
pub modifier: TermModifier, // prefix, fuzzy, etc.
}Serialization:
-
Uses
bincodefor a compact binary representation. -
Typical payload sizes:
- Requests: ~100–500 bytes.
- Responses: ~1–10 KB (depends on
limitand snippet sizes).
Service side
-
Accepts multiple named pipe connections.
-
One async task per client.
-
Each task:
- Reads length‑prefixed messages.
- Executes queries against Tantivy readers.
- Writes responses back.
UI side
- Hidden IPC thread with blocking I/O.
- Posts results into the GPUI application context.
- Keeps UI code free from heavy async concerns.
UltraSearch exposes three main modes:
The default “Everything‑style” mode.
Query building:
fn build_meta_query(&self, expr: &QueryExpr) -> Result<Box<dyn Query>> {
match expr {
QueryExpr::Term(t) => {
let name_query = QueryParser::for_index(index, vec![fields.name])
.parse_query(&t.value)?;
let path_query = QueryParser::for_index(index, vec![fields.path])
.parse_query(&t.value)?;
Ok(Box::new(BooleanQuery::new(vec![
(Occur::Should, name_query),
(Occur::Should, path_query),
])))
}
// other cases...
}
}Execution:
- Single Tantivy query against the metadata index.
- Sorted by relevance (BM25).
- Typical latency: < 10ms for common queries.
Use cases:
- “Locate file X by name.”
- Filters by extension, size, or date.
- Path‑scoped queries for specific trees.
Pure content search when you care only about the contents.
fn build_content_query(&self, expr: &QueryExpr) -> Result<Box<dyn Query>> {
match expr {
QueryExpr::Term(t) => {
let name_query = QueryParser::for_index(index, vec![fields.name])
.parse_query(&t.value)?;
let content_query = QueryParser::for_index(index, vec![fields.content])
.parse_query(&t.value)?;
Ok(Box::new(BooleanQuery::new(vec![
(Occur::Should, name_query),
(Occur::Should, content_query),
])))
}
// ...
}
}- Single Tantivy query against the content index.
- Latency: typically 50–200ms, depending on index size and filters.
Use cases:
- “Find the document that mentions
FooBarBaz.” - Source code search.
- Document content exploration.
Hybrid runs both queries and merges results by DocKey.
fn search_hybrid(&self, req: &SearchRequest) -> SearchResponse {
let fetch_limit = req.limit * 2;
let meta_resp = self.search_meta(&meta_req);
let content_resp = self.search_content(&content_req);
let mut hits_map: HashMap<DocKey, SearchHit> = HashMap::new();
for hit in meta_resp.hits {
hits_map.insert(hit.key, hit);
}
for hit in content_resp.hits {
hits_map.entry(hit.key)
.and_modify(|e| {
e.score = e.score.max(hit.score); // max strategy
if e.snippet.is_none() {
e.snippet = hit.snippet.clone();
}
})
.or_insert(hit);
}
let mut merged: Vec<SearchHit> = hits_map.into_values().collect();
merged.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap());
// apply offset and limit
// ...
}Score strategy:
- Take max(score_meta, score_content) to avoid double‑counting.
- Alternative (sum) can overweight documents that match both, but is more complex for some query types.
Planned improvements:
- Parallel execution of meta+content queries.
- Configurable score fusion strategies.
- Better snippet selection.
UltraSearch uses TOML configuration with environment variable overrides.
Paths
[paths]
meta_index = "%PROGRAMDATA%\\UltraSearch\\index\\meta"
content_index = "%PROGRAMDATA%\\UltraSearch\\index\\content"
log_dir = "%PROGRAMDATA%\\UltraSearch\\logs"
state_dir = "%PROGRAMDATA%\\UltraSearch\\state"
jobs_dir = "%PROGRAMDATA%\\UltraSearch\\jobs"Logging
[logging]
level = "info" # trace, debug, info, warn, error
format = "json" # json or text
file_enabled = true
file_rotation = "daily" # daily, size, never
max_size_mb = 100
retain = 7 # daysScheduler
[scheduler]
idle_warm_seconds = 15
idle_deep_seconds = 60
cpu_soft_limit_pct = 20
cpu_hard_limit_pct = 50
disk_busy_bytes_per_s = 50_000_000
content_batch_size = 1000
max_records_per_tick = 10_000
usn_chunk_bytes = 1_048_576Indexing
[indexing]
max_bytes_per_file = 16_777_216 # 16 MiB
max_chars_per_file = 100_000
extractous_enabled = true
ocr_enabled = false
ocr_max_pages = 10Volumes
[volumes.1] # VolumeId 1
include_paths = ["C:\\Users", "C:\\Projects"]
exclude_paths = ["C:\\Users\\AppData"]
content_indexing = trueFeature flags
[features]
multi_tier_index = false
delta_index = false
adaptive_scheduler = false
doc_type_analyzers = false
semantic_search = false
plugin_system = false
log_dataset_mode = false
mem_opt_tuning = false
auto_tuning = false-
Service exposes a control path to reload config.
-
UI offers a “Reload config” action.
-
Configs are validated on load.
- Invalid configs are rejected.
- Service continues running with the previous valid config.
Why TOML?
- Human‑friendly and supports comments.
- Hierarchical structures map well to UltraSearch’s config needs.
- Strong library support in Rust.
UltraSearch uses tracing for structured logs.
-
Levels:
trace,debug,info,warn,error. -
Formats:
- JSON (file logs, machine‑readable).
- Text (console, human‑readable).
Rotations:
- Daily by default.
- Optional size‑based rotation.
- Archives compressed and retained according to
retainpolicy.
What gets logged:
- Volume discovery and mapping.
- USN journal state and gap handling.
- Index commits and merges.
- Worker spawn/exit/failure events.
- Per‑file extraction errors (with context).
- Query latencies (optionally).
- IPC connection lifecycle events.
Optional HTTP /metrics endpoint exposes Prometheus‑style metrics, e.g.:
# Search latency
ultrasearch_search_latency_ms_bucket{le="0.005"} 1234
ultrasearch_search_latency_ms_bucket{le="0.010"} 5678
ultrasearch_search_latency_ms_sum 123.45
ultrasearch_search_latency_ms_count 10000
# Workers
ultrasearch_worker_cpu_percent 15.5
ultrasearch_worker_mem_bytes 67108864
# Queues
ultrasearch_queue_depth 42
ultrasearch_active_workers 2
# Index sizes
ultrasearch_indexed_files_total{type="meta"} 1000000
ultrasearch_indexed_files_total{type="content"} 500000
ultrasearch_index_size_bytes{type="meta"} 1073741824
Reliable out‑of‑the‑box integration with Prometheus, Grafana, etc.
The IPC protocol also exposes a status call that returns:
- Volume indexing status (indexed count, pending count, last USN).
- Index statistics (size, docs, recent commit times).
- Scheduler state (idle classification, CPU load, queue depths).
- Worker stats (active workers, last exit codes).
Used by:
- The UI for progress bars / status banners.
- External diagnostics and health checks.
Service
- Runs as
LocalSystemor a dedicated service account. - Needs backup/restore privileges for raw NTFS/USN access.
UI
- Runs as the current user.
- Communicates with the service only via named pipes.
- Does not require elevation.
Hardening:
- Tight ACLs on program and data directories to avoid DLL hijacking.
- Workers can be launched with restricted tokens.
- Service exposes no network listener; it is entirely local.
Index corruption
- Tantivy commits are atomic; on crash, you get either old or new state.
- On startup, corruption results in renaming to
*.brokenand a rebuild. - Worst case: time spent re‑indexing; no silent data corruption.
USN journal wrap/recreation
- Detected using
journal_idand USN range checks. - Automatically schedules a volume rescan.
- No manual intervention needed.
Power loss
- Writes are append‑then‑commit; partial batches are simply retried.
- No risk of partial commit corruption.
Worker crashes
- Service monitors worker exit status.
- Files that repeatedly cause crashes can be backoff‑blacklisted.
- Crash info is logged with context for debugging.
Initial metadata build
- Rate: 100k–1M files/sec, depending on disk.
- Memory: < 100MB during build.
- Example: 1M files → on the order of seconds on a modern SSD.
Incremental updates
- Per event: < 1ms typical.
- Throughput: 10k–100k events/sec possible.
Content extraction
- Plain text: ~100–1000 files/sec.
- PDF: ~10–100 files/sec.
- Office docs: ~5–50 files/sec.
- Bounded by worker heap and
max_bytes_per_file.
Index commits
- Latency: 100–500ms per commit (batch‑dependent).
- Frequency: every N docs or fixed time windows for metadata; per‑batch for content.
Filename / NameOnly
- Latency: p50 < 10ms; p95 < 20ms.
- Throughput: 1000+ QPS on modest hardware.
Content
- Latency: 50–200ms typical; p95 < 500ms (index‑size dependent).
- Throughput: 100+ QPS on multi‑core.
Hybrid
- Latency roughly meta+content combined (future: parallelized).
- Throughput depends on fusion strategy and limits.
Service
- Idle: ~20–50MB RSS.
- Under load: ~30–80MB (index readers + caches).
Worker
- Heap: 64–256MB (configurable).
- Extractors, scratch buffers, etc. within that bound.
System‑wide
- Idle (no workers): < 100MB total.
- With active workers: target < 500MB.
- Multiple concurrent workers are bounded via writer heap + job object limits.
A few guiding principles show up everywhere in UltraSearch:
- The service is designed to be permanently running but small.
- Anything heavy (writers, extractors, OCR, big heaps) lives in workers that exit promptly.
- Use NTFS MFT and USN journal.
- Avoid naive directory walkers.
- Use Windows APIs where they provide the right primitive (named pipes, PDH counters, etc.).
- All search‑relevant data lives in Tantivy indices keyed by
DocKey. - No hidden SQL databases or bespoke stores.
- Small auxiliary mappings only where necessary (e.g., VolumeId ↔ GUID).
- Index only when user and system are idle.
- Use low process/thread/I/O priorities.
- Avoid clever but dangerous modes (like background working‑set clamping).
- Prefer memory‑mapped files over large in‑process heaps.
- Use zero‑copy serialization.
- Configure hard caps wherever possible.
This section is for developers building UltraSearch from source or hacking on it.
- Rust nightly toolchain (see
rust-toolchain.toml). - Windows SDK.
- Cargo (ships with Rust).
- For content extraction with Extractous: GraalVM CE 23.x (see
docs/GRAALVM_SETUP.md).
# Build everything
cargo build --release
# Build specific binaries
cargo build --release -p service
cargo build --release -p index-worker
cargo build --release -p ui
cargo build --release -p cli
# Tests
cargo test --all-targets
# Lints
cargo clippy --all-targets -- -D warnings
# Formatting
cargo fmt --checkA Justfile contains common tasks:
# Run all quality gates
just
# Or individually:
just fmt # formatting
just lint # clippy
just test # tests
just check # compile check
just build # release build-
All code must pass:
cargo fmt --checkcargo clippy --all-targets -- -D warnings
-
Run “UBS” (Ultimate Bug Scanner) on changed files before committing.
-
Follow patterns in
RUST_BEST_PRACTICES_GUIDE.md. -
Use workspace‑level dependency management; wildcards only within agreed policy.
ultrasearch/
├── Cargo.toml # Workspace root
├── rust-toolchain.toml # Nightly pin
├── Justfile # Development commands
├── AGENTS.md # Development guidance / meta
├── PROGRESS_REPORT.md
├── IMPLEMENTATION_STATUS.md
├── MODERN_UX_PLAN.md
├── PLAN_TO_BUILD_RUST_WINDOWS_FILE_EXPLORER_TOOL.md
├── UI_FIXES_NEEDED.md
├── UI_IMPLEMENTATION_FINAL_STATUS.md
├── COMPLETE_FIX_PLAN.md
├── COMPREHENSIVE_PLAN_TO_FIX_ALL_REMAINING_UI_UX_ISSUES_AND_FULLY_LEVERAGE_GPUI.md
├── RUST_BEST_PRACTICES_GUIDE.md
├── build_installer.ps1 # Script for building the Windows installer
├── docs/
│ ├── FEATURES.md
│ ├── GRAALVM_SETUP.md
│ └── ADVANCED_FEATURES.md
└── ultrasearch/
├── Cargo.toml # Nested workspace (if used)
└── crates/
├── core-types/ # Shared types, IDs, config
├── core-serialization/ # rkyv/bincode wrappers
├── ntfs-watcher/ # MFT + USN integration
├── meta-index/ # Metadata Tantivy index
├── content-index/ # Content Tantivy index
├── content-extractor/ # Extractous/IFilter/OCR stack
├── scheduler/ # Idle + load heuristics
├── service/ # Windows service host
├── index-worker/ # Batch worker binary
├── ipc/ # Named pipe client/server
├── ui/ # GPUI application
├── cli/ # CLI interface
└── semantic-index/ # Vector/semantic index (advanced)
Planned and experimental features are tracked in docs/ADVANCED_FEATURES.md. Highlights include:
- Multi‑tier index layout – hot/warm/cold tiers for both meta and content.
- In‑memory delta indices – ultra‑hot data in RAM layered over on‑disk segments.
- Document‑type‑aware analyzers – specialized analyzers for code, logs, docs.
- Query planner – AST rewrites, filter pushdown, smarter execution plans.
- Adaptive scheduler – feedback‑driven job scheduling and concurrency control.
- Hybrid semantic search – add vector search for semantic similarity over the existing BM25 stack.
- Plugin architecture – custom extractors and transforms at index time.
- Log file specialization – tailored handling for large append‑only logs.
- Memory optimization work – allocator choices, per‑component footprint tuning.
- Auto‑tuning – runtime heuristics that nudge config toward stable optima.
All of these are intended to be additive and opt‑in, not regress existing behavior.
UltraSearch is licensed under:
- MIT License (
LICENSE-MIT, http://opensource.org/licenses/MIT)
Source code, issues, and discussions live here: