Skip to content

Hemant/rebuild updated#172

Open
hemant-endee wants to merge 9 commits intomasterfrom
hemant/rebuild_updated
Open

Hemant/rebuild updated#172
hemant-endee wants to merge 9 commits intomasterfrom
hemant/rebuild_updated

Conversation

@hemant-endee
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

VectorDB Benchmark — Dense — Passed

Triggered by @omnish-endee · Commit 4a3982f

Step Status
Provision Servers Up
Deploy Endee Server Done
Run Benchmark Passed
Results See below
Teardown Done ✓

No result file found on benchmark server.

@hemant-endee
Copy link
Copy Markdown
Collaborator Author

/correctness_benchmarking dense

3 similar comments
@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

@omnish-endee
Copy link
Copy Markdown
Contributor

/correctness_benchmarking dense

Comment thread src/main.cpp
Comment thread src/main.cpp Outdated
Comment thread src/main.cpp Outdated
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/ndd.hpp
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/rebuild.hpp Outdated
Comment thread src/core/rebuild.hpp
Comment thread src/core/ndd.hpp Outdated
Comment thread src/core/ndd.hpp Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

VectorDB Benchmark - Ready To Run

CI Passed ([lint + unit tests] (https://github.com/endee-io/endee/actions/runs/24928018536)) - benchmark options unlocked.

Post one of the command below. Only members with write access can trigger runs.


Available Modes

Mode Command What runs
Dense /correctness_benchmarking dense HNSW insert throughput · query P50/P95/P99 · recall@10 · concurrent QPS
Hybrid /correctness_benchmarking hybrid Dense + sparse BM25 fusion · same suite + fusion latency overhead

Infrastructure

Server Role Instance
Endee Server Endee VectorDB — code from this branch t2.large
Benchmark Server Benchmark runner t3a.large

Both servers start on demand and are always terminated after the run — pass or fail.


How Correctness Benchmarking Works

1. Post /correctness_benchmarking <mode>
2. Endee Server Create  →  this branch's code deployed  →  Endee starts in chosen mode
3. Benchmark Server Create  →  benchmark suite transferred
4. Benchmark Server runs correctness benchmarking against Endee Server
5. Results posted back here  →  pass/fail + full metrics table
6. Both servers terminated   →  always, even on failure

After a new push, CI must pass again before this menu reappears.

@hemant-endee
Copy link
Copy Markdown
Collaborator Author

/correctness_benchmarking dense

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

VectorDB Benchmark — Dense — Failed

Triggered by @hemant-endee · Commit 117a81f

Step Status
Provision Servers Up
Deploy Endee Server Done
Run Benchmark Failed
Results See reason below
Teardown Done ✓

Benchmark job status: skipped

Comment thread src/core/rebuild.hpp Outdated
Comment on lines 19 to 24
struct RebuildResult {
bool success;
int http_code;
std::string message;
};

Copy link
Copy Markdown
Collaborator

@shaleenji shaleenji Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unsigned char and std::string. declare a generic return struct which will be used everywhere in the codebase eventually.

for each function returning this struct, at the comments above the function definition will define what each return code means. This can be later conglomerated as ENUMS.

Comment thread src/main.cpp Outdated
int code = (message.find("already in progress") != std::string::npos) ? 409 : 400;
return json_error(code, message);
if (!result.success) {
return json_error(result.http_code, result.message);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not couple http codes to the core source. decide the http_code to be returned based on the return code of the function,.

Comment thread src/core/rebuild.hpp Outdated
Comment on lines 29 to 30
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use atomics in status variables

Comment thread src/core/rebuild.hpp Outdated
Comment on lines 74 to 75
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove jthread from here

Comment thread src/core/rebuild.hpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use string comparisons for this.

Comment thread src/core/rebuild.hpp Outdated
} catch (const std::exception&) {
// Silently ignore cleanup errors on startup
} catch (const std::exception& e) {
LOG_WARN(2053, "rebuild", "Failed to cleanup temp files on startup: " << e.what());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 1800 series for rebuild logs. add it to docs/logs.md

@hemant-endee hemant-endee force-pushed the hemant/rebuild_updated branch from 117a81f to 6f4083e Compare April 23, 2026 16:09
…y + IndexManager to executeJob instead of unpacking fields
Comment thread src/core/rebuild.cpp
Comment on lines +29 to +50
void Rebuild::cleanupTempFiles(const std::string& data_dir) {
if (!std::filesystem::exists(data_dir)) {
return;
}
try {
std::string temp_filename = std::string(settings::DEFAULT_SUBINDEX) + ".idx.temp";
std::string ts_prefix = std::string(settings::DEFAULT_SUBINDEX) + ".idx.";
for (const auto& entry : std::filesystem::recursive_directory_iterator(data_dir)) {
if (!entry.is_regular_file()) continue;
const std::string fname = entry.path().filename().string();
bool is_temp = (fname == temp_filename);
bool is_ts = fname.size() > ts_prefix.size()
&& fname.substr(0, ts_prefix.size()) == ts_prefix
&& std::all_of(fname.begin() + ts_prefix.size(), fname.end(), ::isdigit);
if (is_temp || is_ts) {
std::filesystem::remove(entry.path());
}
}
} catch (const std::exception& e) {
LOG_WARN(1803, "rebuild", "Failed to cleanup temp files on startup: " << e.what());
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not throw error if there are not cleanup files left in the directory.

Comment thread src/core/rebuild.cpp
Comment on lines +236 to +237
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is renaming the index file not enough here ?

Comment thread src/core/rebuild.cpp
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);

// Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

st.stop_requested() is not checked in phase 3. makes the shutdown time unbounded for large indexes.

Comment thread src/core/rebuild.cpp
Comment on lines +235 to +243
new_alg->saveIndex(p.timestamped_path);
std::filesystem::copy_file(p.timestamped_path, p.index_path,
std::filesystem::copy_options::overwrite_existing);

// Cannot call reloadIndex() here — we hold operation_mutex and reloadIndex acquires
// indices_mutex_, while deleteIndex holds indices_mutex_ then acquires operation_mutex.
// Calling reloadIndex here would deadlock with a concurrent delete on the same index.
auto fresh_alg = std::make_unique<hnswlib::HierarchicalNSW<float>>(p.index_path, 0);
IndexManager::wireVectorFetchers(fresh_alg.get(), entry->vector_storage);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_alg already has all the data and was just persisted.

can use entry->alg = std::move(new_alg) directly.

doesnt seem useful to re load again

Comment thread docs/rebuild.md
@@ -0,0 +1,124 @@
# Index Rebuild
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that are missing in the doc. make it more comprehensive

Disk space: rebuild needs 2× the index size on disk (timestamped + canonical co-exist briefly).

Memory: 2× the graph in RAM during Phase 2 (old + new).

Expected duration: rough order of magnitude per million vectors helps capacity planning.

What "manually restart" means after a crash — operationally how does an operator know a rebuild was incomplete? (There's no persisted state.)

Comment thread src/core/rebuild.cpp
Comment on lines +160 to +162
std::string Rebuild::formatTime(std::chrono::system_clock::time_point tp) {
return timeToISO8601(tp);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no one is using this function

Comment thread src/core/ndd.hpp
rebuild_.setActiveRebuild(username, index_id, current_count);

// Spawn thread — lambda calls rebuild_.executeJob directly (execution lives in Rebuild)
std::jthread t([this, params = std::move(params)](std::stop_token st) mutable {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here mutable is not required. executeJob takes const RebuildJobParams&, so the lambda doesn't actually mutate

Comment thread src/core/rebuild.cpp
case RebuildStatus::IN_PROGRESS: return "in_progress";
case RebuildStatus::COMPLETED: return "completed";
case RebuildStatus::FAILED: return "failed";
default: return "unknown";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can never reach this line since RebuildStatus has only three defined values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants