Skip to content

Add an option to build components in parallel#6190

Merged
gaiksaya merged 3 commits into
opensearch-project:mainfrom
gaiksaya:parallel
May 14, 2026
Merged

Add an option to build components in parallel#6190
gaiksaya merged 3 commits into
opensearch-project:mainfrom
gaiksaya:parallel

Conversation

@gaiksaya
Copy link
Copy Markdown
Member

@gaiksaya gaiksaya commented May 7, 2026

Description

As of today, all components are build sequentially on jenkins for OpenSearch which takes approximately 2hours. Ref

The depends_on in the manifest (example) is only utilized during incremental builds and NOT during regular builds.

This PR refactors the building logic to utilize the depends_on to build a dependency graph (DAG) and run builds in parallel using multiple threads controlled by --parallel

graph TD
    OpenSearch[/"OpenSearch"/]:::core

    %% Level 1 - No dependencies
    common-utils:::level1
    opensearch-learning-to-rank-base:::level1
    opensearch-remote-metadata-sdk:::level1
    job-scheduler:::level1
    security:::level1
    custom-codecs:::level1
    performance-analyzer:::level1
    query-insights:::level1
    opensearch-system-templates:::level1
    user-behavior-insights:::level1

    %% Level 2
    k-NN:::level2
    geospatial:::level2
    cross-cluster-replication:::level2
    ml-commons:::level2
    notifications-core:::level2
    notifications:::level2
    opensearch-observability:::level2
    opensearch-reports:::level2
    asynchronous-search:::level2
    anomaly-detection:::level2
    alerting:::level2
    index-management:::level2

    %% Level 3
    neural-search:::level3
    sql:::level3
    security-analytics:::level3

    %% Level 4
    flow-framework:::level4
    skills:::level4
    search-relevance:::level4

    %% Implicit deps from core
    OpenSearch -.-> common-utils
    OpenSearch -.-> opensearch-learning-to-rank-base
    OpenSearch -.-> opensearch-remote-metadata-sdk
    OpenSearch -.-> job-scheduler
    OpenSearch -.-> security
    OpenSearch -.-> custom-codecs
    OpenSearch -.-> performance-analyzer
    OpenSearch -.-> query-insights
    OpenSearch -.-> opensearch-system-templates
    OpenSearch -.-> user-behavior-insights

    %% Explicit dependencies
    common-utils --> k-NN
    security --> k-NN
    custom-codecs --> k-NN

    job-scheduler --> geospatial

    common-utils --> cross-cluster-replication

    common-utils --> ml-commons
    job-scheduler --> ml-commons
    opensearch-remote-metadata-sdk --> ml-commons
    security --> ml-commons

    ml-commons --> neural-search
    k-NN --> neural-search

    common-utils --> notifications-core
    common-utils --> notifications
    common-utils --> opensearch-observability

    common-utils --> opensearch-reports
    job-scheduler --> opensearch-reports

    ml-commons --> sql
    geospatial --> sql
    job-scheduler --> sql

    common-utils --> asynchronous-search

    common-utils --> anomaly-detection
    job-scheduler --> anomaly-detection

    common-utils --> alerting

    common-utils --> security-analytics
    alerting --> security-analytics
    job-scheduler --> security-analytics

    common-utils --> index-management
    job-scheduler --> index-management

    common-utils --> flow-framework
    opensearch-remote-metadata-sdk --> flow-framework
    ml-commons --> flow-framework
    k-NN --> flow-framework
    neural-search --> flow-framework

    job-scheduler --> skills
    anomaly-detection --> skills
    sql --> skills
    ml-commons --> skills

    job-scheduler --> search-relevance
    neural-search --> search-relevance
    k-NN --> search-relevance
    ml-commons --> search-relevance
    user-behavior-insights --> search-relevance

    %% Styles
    classDef core fill:#FFD700,stroke:#333,font-weight:bold
    classDef level1 fill:#90EE90,stroke:#333
    classDef level2 fill:#87CEEB,stroke:#333
    classDef level3 fill:#DDA0DD,stroke:#333
    classDef level4 fill:#FFA07A,stroke:#333
Loading

Defaults to 4 parallel threads to start with and adding parallel is optional in the build process for now.
Below are the "build" stats for 3.6.0 builds when ran using sequential build, parallel builds with 4 threads and parallel builds with 8 threads on jenkins.

image

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Reviewer Guide 🔍

(Review updated until commit 0cfda32)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

When continue_on_error=False and a non-required component fails, the exception is re-raised at line 77, but the executor is not shut down before raising. This leaves worker threads running and can cause resource leaks or hanging processes. The shutdown logic at line 91 only executes when future.result() raises, not when the exception is raised directly from run_component.

if not continue_on_error or name in required_components:
    raise
return name
Possible Issue

The failed_plugins list is referenced at line 113 but is only initialized inside the if args.parallel branch at line 92 or within the else branch at line 107. If neither branch executes (which should not happen in normal flow but is structurally possible), this will raise NameError: name 'failed_plugins' is not defined.

if len(failed_plugins) > 0:
    logging.error(f"Failed plugins are {failed_plugins}")
Thread Safety

build_recorder is shared across all threads in _build_parallel without synchronization. If builder.build(build_recorder) or builder.export_artifacts(build_recorder) at lines 156-157 modify shared state in build_recorder, concurrent writes from multiple threads can corrupt data or cause race conditions.

builder.build(build_recorder)
builder.export_artifacts(build_recorder)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Suggestions ✨

Latest suggestions up to 0cfda32

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Fix inconsistent failure tracking state

When a required component fails, the exception is re-raised but the component is
already added to the failed list. This could cause inconsistent state where the
component appears in failed even though execution was aborted. Move the
failed.append(name) inside the conditional to only track failures that are being
suppressed.

src/build_workflow/build_graph.py [72-78]

 except Exception:
+    completed_event[name].set()
+    if not continue_on_error or name in required_components:
+        with lock:
+            failed.append(name)
+        raise
     with lock:
         failed.append(name)
-        completed_event[name].set()
-    if not continue_on_error or name in required_components:
-        raise
     return name
Suggestion importance[1-10]: 8

__

Why: This identifies a real bug where failed.append(name) happens before the exception is re-raised for required components, causing inconsistent state. The suggested fix correctly moves the append inside the conditional to only track suppressed failures.

Medium
Ensure Python version compatibility

Using cancel_futures=True may not be available in all Python versions (added in
3.9). Since the code doesn't specify a minimum Python version requirement, this
could cause runtime errors. Consider using wait=True instead or add version checks.

src/build_workflow/build_graph.py [90-92]

 except Exception:
-    executor.shutdown(wait=False, cancel_futures=True)
+    executor.shutdown(wait=True)
     raise
Suggestion importance[1-10]: 7

__

Why: The cancel_futures=True parameter was added in Python 3.9, and using it without version checks could cause compatibility issues. However, changing to wait=True alters the shutdown behavior (waiting for all futures vs canceling them), which may not be the desired fix. A better approach would be version checking or documenting the Python requirement.

Medium
Optimize dependency lookup performance

The dependency check if dep in self.components uses a list lookup which is O(n) per
dependency. For large component graphs, this becomes inefficient. Convert
self.components to a set for O(1) lookups, or maintain a separate set for tracking
added components.

src/build_workflow/build_graph.py [27-33]

 def add_component(self, name: str, depends_on: Optional[List[str]] = None) -> None:
     self.components.append(name)
     if depends_on:
+        component_set = set(self.components)
         for dep in depends_on:
-            if dep in self.components:
+            if dep in component_set:
                 self.dependencies[name].add(dep)
                 self.dependents[dep].add(name)
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies an O(n) list lookup inefficiency in if dep in self.components. Converting to a set lookup improves performance for large component graphs. However, the impact is moderate since most build graphs are relatively small.

Low

Previous suggestions

Suggestions up to commit 3960b14
CategorySuggestion                                                                                                                                    Impact
Possible issue
Protect checkout from race conditions

The builder.checkout(work_dir) call is not protected by any lock, which could cause
race conditions if multiple threads attempt to checkout to the same directory
simultaneously. Consider adding synchronization around checkout operations or using
component-specific work directories.

src/run_build.py [152-161]

 def build_component(name: str) -> None:
     component = component_map[name]
     logging.info(f"Building {component.name}")
 
     builder = Builders.builder_from(component, target)
-    builder.checkout(work_dir)
+    with recorder_lock:
+        builder.checkout(work_dir)
     builder.build(build_recorder)
     with recorder_lock:
         builder.export_artifacts(build_recorder)
     logging.info(f"Successfully built {component.name}")
Suggestion importance[1-10]: 8

__

Why: This identifies a critical concurrency issue where builder.checkout(work_dir) could cause race conditions when multiple threads checkout to the same directory. The suggested lock protection is important for thread safety in parallel builds.

Medium
General
Remove redundant executor shutdown call

Calling executor.shutdown(wait=False, cancel_futures=True) inside the context
manager is redundant and may cause issues. The context manager's exit will
handle cleanup automatically. Remove the explicit shutdown call to avoid potential
race conditions or double-cleanup errors.

src/build_workflow/build_graph.py [90-92]

 except Exception:
-    executor.shutdown(wait=False, cancel_futures=True)
     raise
Suggestion importance[1-10]: 7

__

Why: The explicit executor.shutdown() call inside the context manager is redundant since the context manager handles cleanup. Removing it prevents potential race conditions and follows best practices for context manager usage.

Medium
Optimize dependency lookup performance

The dependency check if dep in self.components uses a list lookup which is O(n) for
each dependency. For large component graphs, this creates O(n*m) complexity. Convert
self.components to a set for O(1) lookups, or check against self.dependencies.keys()
instead.

src/build_workflow/build_graph.py [27-33]

 def add_component(self, name: str, depends_on: Optional[List[str]] = None) -> None:
     self.components.append(name)
     if depends_on:
+        existing = set(self.components)
         for dep in depends_on:
-            if dep in self.components:
+            if dep in existing:
                 self.dependencies[name].add(dep)
                 self.dependents[dep].add(name)
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a performance issue with O(n) list lookups in self.components. Converting to a set for lookups improves efficiency, though the impact is moderate unless dealing with very large component graphs.

Low
Suggestions up to commit fc0d08e
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure dependency events are always set to prevent deadlock

The wait_for_deps function only checks if the immediate dependency failed, but does
not propagate failure if a transitive dependency failed and was skipped (added to
failed without raising). A component whose direct dependency was skipped (not failed
itself but skipped due to its own dep failure) will still be marked failed in
failed, so this works transitively — but the logic relies on skipped components also
being added to failed. This is correct as written, but the function name and comment
should clarify that "failed" includes skipped components to avoid future confusion.
More critically, if completed_event[dep] is never set (e.g., due to an unhandled
exception path), wait() will block forever. Ensure all code paths in run_component
always call completed_event[name].set().

src/build_workflow/build_graph.py [52-58]

-def wait_for_deps(name: str) -> bool:
-    for dep in self.dependencies[name]:
-        completed_event[dep].wait()
-        with lock:
-            if dep in failed:
-                return False
-    return True
+def run_component(name: str) -> str:
+    try:
+        if not wait_for_deps(name):
+            with lock:
+                failed.append(name)
+            logging.info(f"Skipping {name} because a dependency failed.")
+            return name
 
+        try:
+            build_fn(name)
+            return ""
+        except Exception:
+            with lock:
+                failed.append(name)
+            if not continue_on_error and name in critical_components:
+                raise
+            return name
+    finally:
+        completed_event[name].set()
+
Suggestion importance[1-10]: 8

__

Why: This is a valid and important bug fix — if an unhandled exception escapes run_component, completed_event[name].set() may never be called, causing dependent components to block forever. Using a finally block to guarantee the event is set prevents potential deadlocks.

Medium
Potential race condition in shared build recorder access

The builder.checkout and builder.build calls in build_component are not protected by
recorder_lock, but they share the same work_dir path for all parallel components. If
checkout writes to a shared location or build reads/writes shared state in
build_recorder, this can cause race conditions. Consider also protecting
builder.checkout and builder.build with the lock, or ensure each component uses an
isolated working directory.

src/run_build.py [158-161]

 builder.checkout(work_dir)
 builder.build(build_recorder)
 with recorder_lock:
     builder.export_artifacts(build_recorder)
+    # Note: If build_recorder is mutated during build(), protect that call too.
Suggestion importance[1-10]: 3

__

Why: The suggestion points out a potential race condition with build_recorder during builder.build(), but the improved_code is essentially the same as existing_code with just a comment added. The actual fix (adding a lock around builder.build) is not implemented in the improved code, making this suggestion low-value.

Low
General
Fragile core component name derivation may silently skip core build

If core_name is not in component_map (e.g., for a Dashboards-only build), the core
is silently skipped and all components are built in parallel. However,
remaining_components will include all selected components, which may be correct. But
if core_name is in selected_names but not component_map, this could be a silent bug.
More importantly, the core_name derivation using replace(" ", "-") is fragile — for
example, "OpenSearch Dashboards" becomes "OpenSearch-Dashboards", which may or may
not match the actual component name. Consider using a more robust lookup or an
explicit configuration.

src/run_build.py [129-145]

 core_name = manifest.build.name.replace(" ", "-")
+if core_name not in component_map:
+    logging.warning(f"Core component '{core_name}' not found in selected components; skipping sequential core build.")
+
 if core_name in component_map:
-    core_component = component_map[core_name]
     ...
-    raise
 
-# Now build remaining components in parallel
-remaining_components = [c for c in selected_components if c.name != core_name]
-
Suggestion importance[1-10]: 4

__

Why: The concern about fragile core_name derivation using replace(" ", "-") is valid, but the improved_code only adds a warning log without actually fixing the fragility. The suggestion is more of a style/robustness improvement rather than a critical fix.

Low
Critical failures may be silently omitted from returned failure list

When continue_on_error=False and a critical component fails inside execute_parallel,
the exception is re-raised and propagates out of execute_parallel directly — so the
post-execution check for critical_failures is only reached when
continue_on_error=True. However, when continue_on_error=True, critical component
failures are silently swallowed by execute_parallel (since continue_on_error=True
suppresses the raise for all components including critical ones in
execute_parallel). The _build_parallel function then raises after the fact, but the
failed list returned to the caller will not include critical failures (they are
filtered out in the return statement). This means the caller's failed_plugins list
will be incomplete. Consider returning all failures or handling this more
explicitly.

src/run_build.py [170-175]

 if not args.continue_on_error:
     critical_failures = [f for f in failed if f in CRITICAL_COMPONENTS]
     if critical_failures:
         raise Exception(f"Critical component(s) failed: {critical_failures}")
 
+# Return all non-critical failures; critical failures already raised above if not continue_on_error
 return [f for f in failed if f not in CRITICAL_COMPONENTS]
Suggestion importance[1-10]: 4

__

Why: The analysis of the control flow is partially correct — when continue_on_error=True, critical failures are included in failed but filtered out of the return value. However, the improved_code is identical to existing_code with only a comment added, so no actual fix is provided.

Low

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 99.34426% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.66%. Comparing base (4321f76) to head (0cfda32).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/run_build.py 98.18% 1 Missing ⚠️
tests/test_run_build.py 98.70% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6190      +/-   ##
==========================================
+ Coverage   96.62%   96.66%   +0.04%     
==========================================
  Files         405      407       +2     
  Lines       19039    19329     +290     
==========================================
+ Hits        18397    18685     +288     
- Misses        642      644       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@peterzhuamazon peterzhuamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gaiksaya
Generally happy with this change but with a few comments.

  1. Internet might be throttling especially for home network, downloading multiple repos while later pulling dependencies from maven/npm push pressure on both host machine and remote repo (already see multiple times maven block/npm block due to too many connections in parallel from same ip)
  2. whether the host has enough space / cpu to run this. Especially it has a side effect when using with --keep that can silently increase temp folder size without knowing
  3. Suggest make default threads to 2 when using --parallel, good that default is still non-parallel so this does not break behavior.

Thanks.

Comment thread src/run_build.py Outdated
from system import console
from system.temporary_directory import TemporaryDirectory

CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add opensearch-remote-metadata-sdk as a critical component.

Copy link
Copy Markdown
Member Author

@gaiksaya gaiksaya May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not touch what we already had but looks like only 2 components reply on it so not sure if it should be critical.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sure let me rename the variable.

@gaiksaya
Copy link
Copy Markdown
Member Author

gaiksaya commented May 7, 2026

  1. Internet might be throttling especially for home network, downloading multiple repos while later pulling dependencies from maven/npm push pressure on both host machine and remote repo (already see multiple times maven block/npm block due to too many connections in parallel from same ip)

Most of our own dependencies are pulled from local maven. Maybe we can monitor for sometime before taking the call or using 2? Atleast for all test runs and local building I did not see an issue for now.

  1. whether the host has enough space / cpu to run this. Especially it has a side effect when using with --keep that can silently increase temp folder size without knowing

We are not using --keep in the CI/CD so focusing on that. It's the same temp dir regardless of seq or parallel. But I could add a warning if you like.

  1. Suggest make default threads to 2 when using --parallel, good that default is still non-parallel so this does not break behavior.

I would like to test with 4 and decrease/increase if required. Running 2 in parallel does not provide much of a value IMO.

Comment thread src/run_build.py Outdated
from system import console
from system.temporary_directory import TemporaryDirectory

CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"

Comment thread src/run_build.py
return 0


def _build_parallel(manifest: InputManifest, target: BuildTarget, build_recorder: BuildRecorder, work_dir: str, args: BuildArgs, components: list) -> list:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My AI reviewing partner tells me that BuildRecorder is not thread safe but is being mutated concurrently here.

Copy link
Copy Markdown
Member Author

@gaiksaya gaiksaya May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like diff analyzer has same review. Removed the lock on buildRecorder altogerther.
record_component() writes to components_hash[name] and record_artifact() appends to components_hash[name]["artifacts"]. Basically each parallel thread only operates on its own component name as the key. There's no shared mutable state between threads since the data is naturally partitioned by component name.

Comment thread src/run_build.py Outdated
if critical_failures:
raise Exception(f"Critical component(s) failed: {critical_failures}")

return [f for f in failed if f not in CRITICAL_COMPONENTS]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if --continue-on-error is specified and one the critical components fails?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hard fails. --continue-on-error does not matter as this point. Looks like there was bug with this. Fixed in recent commit. When required component fails, exception is raised no matter what. For others, only raised when continue-on-error is false.

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Persistent review updated to latest commit 3960b14

Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Persistent review updated to latest commit 0cfda32

@gaiksaya gaiksaya enabled auto-merge (squash) May 12, 2026 18:44
@gaiksaya gaiksaya disabled auto-merge May 14, 2026 03:13
@gaiksaya gaiksaya merged commit 03d2dca into opensearch-project:main May 14, 2026
17 checks passed
@gaiksaya gaiksaya deleted the parallel branch May 14, 2026 03:13
@github-project-automation github-project-automation Bot moved this from 👀 In Review to ✅ Done in Engineering Effectiveness Board May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants