Add an option to build components in parallel by gaiksaya · Pull Request #6190 · opensearch-project/opensearch-build

gaiksaya · 2026-05-07T20:34:25Z

Description

As of today, all components are build sequentially on jenkins for OpenSearch which takes approximately 2hours. Ref

The depends_on in the manifest (example) is only utilized during incremental builds and NOT during regular builds.

This PR refactors the building logic to utilize the depends_on to build a dependency graph (DAG) and run builds in parallel using multiple threads controlled by --parallel

graph TD
    OpenSearch[/"OpenSearch"/]:::core

    %% Level 1 - No dependencies
    common-utils:::level1
    opensearch-learning-to-rank-base:::level1
    opensearch-remote-metadata-sdk:::level1
    job-scheduler:::level1
    security:::level1
    custom-codecs:::level1
    performance-analyzer:::level1
    query-insights:::level1
    opensearch-system-templates:::level1
    user-behavior-insights:::level1

    %% Level 2
    k-NN:::level2
    geospatial:::level2
    cross-cluster-replication:::level2
    ml-commons:::level2
    notifications-core:::level2
    notifications:::level2
    opensearch-observability:::level2
    opensearch-reports:::level2
    asynchronous-search:::level2
    anomaly-detection:::level2
    alerting:::level2
    index-management:::level2

    %% Level 3
    neural-search:::level3
    sql:::level3
    security-analytics:::level3

    %% Level 4
    flow-framework:::level4
    skills:::level4
    search-relevance:::level4

    %% Implicit deps from core
    OpenSearch -.-> common-utils
    OpenSearch -.-> opensearch-learning-to-rank-base
    OpenSearch -.-> opensearch-remote-metadata-sdk
    OpenSearch -.-> job-scheduler
    OpenSearch -.-> security
    OpenSearch -.-> custom-codecs
    OpenSearch -.-> performance-analyzer
    OpenSearch -.-> query-insights
    OpenSearch -.-> opensearch-system-templates
    OpenSearch -.-> user-behavior-insights

    %% Explicit dependencies
    common-utils --> k-NN
    security --> k-NN
    custom-codecs --> k-NN

    job-scheduler --> geospatial

    common-utils --> cross-cluster-replication

    common-utils --> ml-commons
    job-scheduler --> ml-commons
    opensearch-remote-metadata-sdk --> ml-commons
    security --> ml-commons

    ml-commons --> neural-search
    k-NN --> neural-search

    common-utils --> notifications-core
    common-utils --> notifications
    common-utils --> opensearch-observability

    common-utils --> opensearch-reports
    job-scheduler --> opensearch-reports

    ml-commons --> sql
    geospatial --> sql
    job-scheduler --> sql

    common-utils --> asynchronous-search

    common-utils --> anomaly-detection
    job-scheduler --> anomaly-detection

    common-utils --> alerting

    common-utils --> security-analytics
    alerting --> security-analytics
    job-scheduler --> security-analytics

    common-utils --> index-management
    job-scheduler --> index-management

    common-utils --> flow-framework
    opensearch-remote-metadata-sdk --> flow-framework
    ml-commons --> flow-framework
    k-NN --> flow-framework
    neural-search --> flow-framework

    job-scheduler --> skills
    anomaly-detection --> skills
    sql --> skills
    ml-commons --> skills

    job-scheduler --> search-relevance
    neural-search --> search-relevance
    k-NN --> search-relevance
    ml-commons --> search-relevance
    user-behavior-insights --> search-relevance

    %% Styles
    classDef core fill:#FFD700,stroke:#333,font-weight:bold
    classDef level1 fill:#90EE90,stroke:#333
    classDef level2 fill:#87CEEB,stroke:#333
    classDef level3 fill:#DDA0DD,stroke:#333
    classDef level4 fill:#FFA07A,stroke:#333

Defaults to 4 parallel threads to start with and adding parallel is optional in the build process for now.
Below are the "build" stats for 3.6.0 builds when ran using sequential build, parallel builds with 4 threads and parallel builds with 8 threads on jenkins.

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

github-actions · 2026-05-07T20:36:03Z

PR Reviewer Guide 🔍

(Review updated until commit `0cfda32`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue When `continue_on_error=False` and a non-required component fails, the exception is re-raised at line 77, but the executor is not shut down before raising. This leaves worker threads running and can cause resource leaks or hanging processes. The shutdown logic at line 91 only executes when `future.result()` raises, not when the exception is raised directly from `run_component`. if not continue_on_error or name in required_components: raise return name Possible Issue The `failed_plugins` list is referenced at line 113 but is only initialized inside the `if args.parallel` branch at line 92 or within the `else` branch at line 107. If neither branch executes (which should not happen in normal flow but is structurally possible), this will raise `NameError: name 'failed_plugins' is not defined`. if len(failed_plugins) > 0: logging.error(f"Failed plugins are {failed_plugins}") Thread Safety `build_recorder` is shared across all threads in `_build_parallel` without synchronization. If `builder.build(build_recorder)` or `builder.export_artifacts(build_recorder)` at lines 156-157 modify shared state in `build_recorder`, concurrent writes from multiple threads can corrupt data or cause race conditions. builder.build(build_recorder) builder.export_artifacts(build_recorder)

github-actions · 2026-05-07T20:36:46Z

PR Code Suggestions ✨

Latest suggestions up to 0cfda32

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Fix inconsistent failure tracking state When a required component fails, the exception is re-raised but the component is already added to the `failed` list. This could cause inconsistent state where the component appears in `failed` even though execution was aborted. Move the `failed.append(name)` inside the conditional to only track failures that are being suppressed. src/build_workflow/build_graph.py [72-78] except Exception: + completed_event[name].set() + if not continue_on_error or name in required_components: + with lock: + failed.append(name) + raise with lock: failed.append(name) - completed_event[name].set() - if not continue_on_error or name in required_components: - raise return name Suggestion importance[1-10]: 8 __ Why: This identifies a real bug where `failed.append(name)` happens before the exception is re-raised for required components, causing inconsistent state. The suggested fix correctly moves the append inside the conditional to only track suppressed failures.	Medium
	Ensure Python version compatibility Using `cancel_futures=True` may not be available in all Python versions (added in 3.9). Since the code doesn't specify a minimum Python version requirement, this could cause runtime errors. Consider using `wait=True` instead or add version checks. src/build_workflow/build_graph.py [90-92] except Exception: - executor.shutdown(wait=False, cancel_futures=True) + executor.shutdown(wait=True) raise Suggestion importance[1-10]: 7 __ Why: The `cancel_futures=True` parameter was added in Python 3.9, and using it without version checks could cause compatibility issues. However, changing to `wait=True` alters the shutdown behavior (waiting for all futures vs canceling them), which may not be the desired fix. A better approach would be version checking or documenting the Python requirement.	Medium
	Optimize dependency lookup performance The dependency check `if dep in self.components` uses a list lookup which is O(n) per dependency. For large component graphs, this becomes inefficient. Convert `self.components` to a set for O(1) lookups, or maintain a separate set for tracking added components. src/build_workflow/build_graph.py [27-33] def add_component(self, name: str, depends_on: Optional[List[str]] = None) -> None: self.components.append(name) if depends_on: + component_set = set(self.components) for dep in depends_on: - if dep in self.components: + if dep in component_set: self.dependencies[name].add(dep) self.dependents[dep].add(name) Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies an O(n) list lookup inefficiency in `if dep in self.components`. Converting to a set lookup improves performance for large component graphs. However, the impact is moderate since most build graphs are relatively small.	Low

Previous suggestions

Suggestions up to commit 3960b14

Category	Suggestion	Impact
Possible issue	Protect checkout from race conditions The `builder.checkout(work_dir)` call is not protected by any lock, which could cause race conditions if multiple threads attempt to checkout to the same directory simultaneously. Consider adding synchronization around checkout operations or using component-specific work directories. src/run_build.py [152-161] def build_component(name: str) -> None: component = component_map[name] logging.info(f"Building {component.name}") builder = Builders.builder_from(component, target) - builder.checkout(work_dir) + with recorder_lock: + builder.checkout(work_dir) builder.build(build_recorder) with recorder_lock: builder.export_artifacts(build_recorder) logging.info(f"Successfully built {component.name}") Suggestion importance[1-10]: 8 __ Why: This identifies a critical concurrency issue where `builder.checkout(work_dir)` could cause race conditions when multiple threads checkout to the same directory. The suggested lock protection is important for thread safety in parallel builds.	Medium
General	Remove redundant executor shutdown call Calling `executor.shutdown(wait=False, cancel_futures=True)` inside the context manager is redundant and may cause issues. The context manager's `exit` will handle cleanup automatically. Remove the explicit shutdown call to avoid potential race conditions or double-cleanup errors. src/build_workflow/build_graph.py [90-92] except Exception: - executor.shutdown(wait=False, cancel_futures=True) raise Suggestion importance[1-10]: 7 __ Why: The explicit `executor.shutdown()` call inside the context manager is redundant since the context manager handles cleanup. Removing it prevents potential race conditions and follows best practices for context manager usage.	Medium
General	Optimize dependency lookup performance The dependency check `if dep in self.components` uses a list lookup which is O(n) for each dependency. For large component graphs, this creates O(n*m) complexity. Convert `self.components` to a set for O(1) lookups, or check against `self.dependencies.keys()` instead. src/build_workflow/build_graph.py [27-33] def add_component(self, name: str, depends_on: Optional[List[str]] = None) -> None: self.components.append(name) if depends_on: + existing = set(self.components) for dep in depends_on: - if dep in self.components: + if dep in existing: self.dependencies[name].add(dep) self.dependents[dep].add(name) Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies a performance issue with O(n) list lookups in `self.components`. Converting to a set for lookups improves efficiency, though the impact is moderate unless dealing with very large component graphs.	Low

Suggestions up to commit fc0d08e

Category	Suggestion	Impact
Possible issue	Ensure dependency events are always set to prevent deadlock The `wait_for_deps` function only checks if the immediate dependency failed, but does not propagate failure if a transitive dependency failed and was skipped (added to `failed` without raising). A component whose direct dependency was skipped (not failed itself but skipped due to its own dep failure) will still be marked failed in `failed`, so this works transitively — but the logic relies on skipped components also being added to `failed`. This is correct as written, but the function name and comment should clarify that "failed" includes skipped components to avoid future confusion. More critically, if `completed_event[dep]` is never set (e.g., due to an unhandled exception path), `wait()` will block forever. Ensure all code paths in `run_component` always call `completed_event[name].set()`. src/build_workflow/build_graph.py [52-58] -def wait_for_deps(name: str) -> bool: - for dep in self.dependencies[name]: - completed_event[dep].wait() - with lock: - if dep in failed: - return False - return True +def run_component(name: str) -> str: + try: + if not wait_for_deps(name): + with lock: + failed.append(name) + logging.info(f"Skipping {name} because a dependency failed.") + return name + try: + build_fn(name) + return "" + except Exception: + with lock: + failed.append(name) + if not continue_on_error and name in critical_components: + raise + return name + finally: + completed_event[name].set() + Suggestion importance[1-10]: 8 __ Why: This is a valid and important bug fix — if an unhandled exception escapes `run_component`, `completed_event[name].set()` may never be called, causing dependent components to block forever. Using a `finally` block to guarantee the event is set prevents potential deadlocks.	Medium
Possible issue	Potential race condition in shared build recorder access The `builder.checkout` and `builder.build` calls in `build_component` are not protected by `recorder_lock`, but they share the same `work_dir` path for all parallel components. If `checkout` writes to a shared location or `build` reads/writes shared state in `build_recorder`, this can cause race conditions. Consider also protecting `builder.checkout` and `builder.build` with the lock, or ensure each component uses an isolated working directory. src/run_build.py [158-161] builder.checkout(work_dir) builder.build(build_recorder) with recorder_lock: builder.export_artifacts(build_recorder) + # Note: If build_recorder is mutated during build(), protect that call too. Suggestion importance[1-10]: 3 __ Why: The suggestion points out a potential race condition with `build_recorder` during `builder.build()`, but the `improved_code` is essentially the same as `existing_code` with just a comment added. The actual fix (adding a lock around `builder.build`) is not implemented in the improved code, making this suggestion low-value.	Low
General	Fragile core component name derivation may silently skip core build If `core_name` is not in `component_map` (e.g., for a Dashboards-only build), the core is silently skipped and all components are built in parallel. However, `remaining_components` will include all selected components, which may be correct. But if `core_name` is in `selected_names` but not `component_map`, this could be a silent bug. More importantly, the `core_name` derivation using `replace(" ", "-")` is fragile — for example, "OpenSearch Dashboards" becomes "OpenSearch-Dashboards", which may or may not match the actual component name. Consider using a more robust lookup or an explicit configuration. src/run_build.py [129-145] core_name = manifest.build.name.replace(" ", "-") +if core_name not in component_map: + logging.warning(f"Core component '{core_name}' not found in selected components; skipping sequential core build.") + if core_name in component_map: - core_component = component_map[core_name] ... - raise -# Now build remaining components in parallel -remaining_components = [c for c in selected_components if c.name != core_name] - Suggestion importance[1-10]: 4 __ Why: The concern about fragile `core_name` derivation using `replace(" ", "-")` is valid, but the `improved_code` only adds a warning log without actually fixing the fragility. The suggestion is more of a style/robustness improvement rather than a critical fix.	Low
General	Critical failures may be silently omitted from returned failure list When `continue_on_error=False` and a critical component fails inside `execute_parallel`, the exception is re-raised and propagates out of `execute_parallel` directly — so the post-execution check for `critical_failures` is only reached when `continue_on_error=True`. However, when `continue_on_error=True`, critical component failures are silently swallowed by `execute_parallel` (since `continue_on_error=True` suppresses the raise for all components including critical ones in `execute_parallel`). The `_build_parallel` function then raises after the fact, but the `failed` list returned to the caller will not include critical failures (they are filtered out in the return statement). This means the caller's `failed_plugins` list will be incomplete. Consider returning all failures or handling this more explicitly. src/run_build.py [170-175] if not args.continue_on_error: critical_failures = [f for f in failed if f in CRITICAL_COMPONENTS] if critical_failures: raise Exception(f"Critical component(s) failed: {critical_failures}") +# Return all non-critical failures; critical failures already raised above if not continue_on_error return [f for f in failed if f not in CRITICAL_COMPONENTS] Suggestion importance[1-10]: 4 __ Why: The analysis of the control flow is partially correct — when `continue_on_error=True`, critical failures are included in `failed` but filtered out of the return value. However, the `improved_code` is identical to `existing_code` with only a comment added, so no actual fix is provided.	Low

codecov · 2026-05-07T20:42:38Z

Codecov Report

❌ Patch coverage is 99.34426% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.66%. Comparing base (4321f76) to head (0cfda32).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/run_build.py	98.18%	1 Missing ⚠️
tests/test_run_build.py	98.70%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6190      +/-   ##
==========================================
+ Coverage   96.62%   96.66%   +0.04%     
==========================================
  Files         405      407       +2     
  Lines       19039    19329     +290     
==========================================
+ Hits        18397    18685     +288     
- Misses        642      644       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

peterzhuamazon

Hi @gaiksaya
Generally happy with this change but with a few comments.

Internet might be throttling especially for home network, downloading multiple repos while later pulling dependencies from maven/npm push pressure on both host machine and remote repo (already see multiple times maven block/npm block due to too many connections in parallel from same ip)
whether the host has enough space / cpu to run this. Especially it has a side effect when using with --keep that can silently increase temp folder size without knowing
Suggest make default threads to 2 when using --parallel, good that default is still non-parallel so this does not break behavior.

Thanks.

peterzhuamazon · 2026-05-07T20:50:55Z

 from system import console
 from system.temporary_directory import TemporaryDirectory

+CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards']


We should add opensearch-remote-metadata-sdk as a critical component.

I did not touch what we already had but looks like only 2 components reply on it so not sure if it should be critical.

Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"

Yes. Sure let me rename the variable.

gaiksaya · 2026-05-07T21:10:23Z

Internet might be throttling especially for home network, downloading multiple repos while later pulling dependencies from maven/npm push pressure on both host machine and remote repo (already see multiple times maven block/npm block due to too many connections in parallel from same ip)

Most of our own dependencies are pulled from local maven. Maybe we can monitor for sometime before taking the call or using 2? Atleast for all test runs and local building I did not see an issue for now.

whether the host has enough space / cpu to run this. Especially it has a side effect when using with --keep that can silently increase temp folder size without knowing

We are not using --keep in the CI/CD so focusing on that. It's the same temp dir regardless of seq or parallel. But I could add a warning if you like.

Suggest make default threads to 2 when using --parallel, good that default is still non-parallel so this does not break behavior.

I would like to test with 4 and decrease/increase if required. Running 2 in parallel does not provide much of a value IMO.

andrross · 2026-05-07T21:18:45Z

 from system import console
 from system.temporary_directory import TemporaryDirectory

+CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards']


Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"

andrross · 2026-05-07T21:21:16Z

    return 0


+def _build_parallel(manifest: InputManifest, target: BuildTarget, build_recorder: BuildRecorder, work_dir: str, args: BuildArgs, components: list) -> list:


My AI reviewing partner tells me that BuildRecorder is not thread safe but is being mutated concurrently here.

Looks like diff analyzer has same review. Removed the lock on buildRecorder altogerther.
record_component() writes to components_hash[name] and record_artifact() appends to components_hash[name]["artifacts"]. Basically each parallel thread only operates on its own component name as the key. There's no shared mutable state between threads since the data is naturally partitioned by component name.

andrross · 2026-05-07T21:24:36Z

+        if critical_failures:
+            raise Exception(f"Critical component(s) failed: {critical_failures}")
+
+    return [f for f in failed if f not in CRITICAL_COMPONENTS]


What happens if --continue-on-error is specified and one the critical components fails?

If hard fails. --continue-on-error does not matter as this point. Looks like there was bug with this. Fixed in recent commit. When required component fails, exception is raised no matter what. For others, only raised when continue-on-error is false.

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

github-actions · 2026-05-07T22:12:37Z

Persistent review updated to latest commit 3960b14

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

github-actions · 2026-05-07T23:01:41Z

Persistent review updated to latest commit 0cfda32

Add an option to build components in parallel

fc0d08e

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

gaiksaya requested review from Divyaasm, peterzhuamazon, prudhvigodithi, rishabh6788, tianleh and zelinh as code owners May 7, 2026 20:34

github-project-automation Bot added this to Engineering Effectiveness Board May 7, 2026

gaiksaya added the enhancement New Enhancement label May 7, 2026

gaiksaya self-assigned this May 7, 2026

github-actions Bot added the distinguished-contributor label May 7, 2026

peterzhuamazon reviewed May 7, 2026

View reviewed changes

andrross reviewed May 7, 2026

View reviewed changes

Rename var and fix continue-on-error logic

3960b14

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

Remove lock

0cfda32

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>

gaiksaya mentioned this pull request May 8, 2026

Sort the build recorder on plugin list so plugins can be installed by order #6175

Merged

andrross approved these changes May 8, 2026

View reviewed changes

peterzhuamazon approved these changes May 11, 2026

View reviewed changes

github-project-automation Bot moved this to 👀 In Review in Engineering Effectiveness Board May 11, 2026

This was referenced May 11, 2026

Add args to enable parallel builds opensearch-project/opensearch-build-libraries#827

Merged

Add args to jenkins for parallel builds & enable it by default #6198

Merged

gaiksaya enabled auto-merge (squash) May 12, 2026 18:44

gaiksaya disabled auto-merge May 14, 2026 03:13

gaiksaya merged commit 03d2dca into opensearch-project:main May 14, 2026
17 checks passed

gaiksaya deleted the parallel branch May 14, 2026 03:13

github-project-automation Bot moved this from 👀 In Review to ✅ Done in Engineering Effectiveness Board May 14, 2026

		return 0


		def _build_parallel(manifest: InputManifest, target: BuildTarget, build_recorder: BuildRecorder, work_dir: str, args: BuildArgs, components: list) -> list:

Conversation

gaiksaya commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 0cfda32)

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

peterzhuamazon left a comment

Choose a reason for hiding this comment

Uh oh!

peterzhuamazon May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaiksaya May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrross May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaiksaya May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaiksaya commented May 7, 2026

Uh oh!

andrross May 7, 2026

Choose a reason for hiding this comment

Uh oh!

andrross May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaiksaya May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrross May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaiksaya May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gaiksaya commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

(Review updated until commit `0cfda32`)

github-actions Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading

gaiksaya May 7, 2026 •

edited

Loading

gaiksaya May 7, 2026 •

edited

Loading