Add an option to build components in parallel#6190
Conversation
Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 0cfda32)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 0cfda32 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 3960b14
Suggestions up to commit fc0d08e
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6190 +/- ##
==========================================
+ Coverage 96.62% 96.66% +0.04%
==========================================
Files 405 407 +2
Lines 19039 19329 +290
==========================================
+ Hits 18397 18685 +288
- Misses 642 644 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
peterzhuamazon
left a comment
There was a problem hiding this comment.
Hi @gaiksaya
Generally happy with this change but with a few comments.
- Internet might be throttling especially for home network, downloading multiple repos while later pulling dependencies from maven/npm push pressure on both host machine and remote repo (already see multiple times maven block/npm block due to too many connections in parallel from same ip)
- whether the host has enough space / cpu to run this. Especially it has a side effect when using with --keep that can silently increase temp folder size without knowing
- Suggest make default threads to 2 when using --parallel, good that default is still non-parallel so this does not break behavior.
Thanks.
| from system import console | ||
| from system.temporary_directory import TemporaryDirectory | ||
|
|
||
| CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards'] |
There was a problem hiding this comment.
We should add opensearch-remote-metadata-sdk as a critical component.
There was a problem hiding this comment.
I did not touch what we already had but looks like only 2 components reply on it so not sure if it should be critical.
There was a problem hiding this comment.
Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"
There was a problem hiding this comment.
Yes. Sure let me rename the variable.
Most of our own dependencies are pulled from local maven. Maybe we can monitor for sometime before taking the call or using 2? Atleast for all test runs and local building I did not see an issue for now.
We are not using --keep in the CI/CD so focusing on that. It's the same temp dir regardless of seq or parallel. But I could add a warning if you like.
I would like to test with 4 and decrease/increase if required. Running 2 in parallel does not provide much of a value IMO. |
| from system import console | ||
| from system.temporary_directory import TemporaryDirectory | ||
|
|
||
| CRITICAL_COMPONENTS = ['OpenSearch', 'job-scheduler', 'common-utils', 'OpenSearch-Dashboards'] |
There was a problem hiding this comment.
Is the purpose of this list that these components must succeed to create a build even if --continue-on-error is passed in? Might be worth adding a comment, and/or renaming them to "required components"
| return 0 | ||
|
|
||
|
|
||
| def _build_parallel(manifest: InputManifest, target: BuildTarget, build_recorder: BuildRecorder, work_dir: str, args: BuildArgs, components: list) -> list: |
There was a problem hiding this comment.
My AI reviewing partner tells me that BuildRecorder is not thread safe but is being mutated concurrently here.
There was a problem hiding this comment.
Looks like diff analyzer has same review. Removed the lock on buildRecorder altogerther.
record_component() writes to components_hash[name] and record_artifact() appends to components_hash[name]["artifacts"]. Basically each parallel thread only operates on its own component name as the key. There's no shared mutable state between threads since the data is naturally partitioned by component name.
| if critical_failures: | ||
| raise Exception(f"Critical component(s) failed: {critical_failures}") | ||
|
|
||
| return [f for f in failed if f not in CRITICAL_COMPONENTS] |
There was a problem hiding this comment.
What happens if --continue-on-error is specified and one the critical components fails?
There was a problem hiding this comment.
If hard fails. --continue-on-error does not matter as this point. Looks like there was bug with this. Fixed in recent commit. When required component fails, exception is raised no matter what. For others, only raised when continue-on-error is false.
Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
|
Persistent review updated to latest commit 3960b14 |
Signed-off-by: Sayali Gaikawad <gaiksaya@amazon.com>
|
Persistent review updated to latest commit 0cfda32 |
Description
As of today, all components are build sequentially on jenkins for OpenSearch which takes approximately 2hours. Ref
The
depends_onin the manifest (example) is only utilized during incremental builds and NOT during regular builds.This PR refactors the building logic to utilize the depends_on to build a dependency graph (DAG) and run builds in parallel using multiple threads controlled by
--parallelgraph TD OpenSearch[/"OpenSearch"/]:::core %% Level 1 - No dependencies common-utils:::level1 opensearch-learning-to-rank-base:::level1 opensearch-remote-metadata-sdk:::level1 job-scheduler:::level1 security:::level1 custom-codecs:::level1 performance-analyzer:::level1 query-insights:::level1 opensearch-system-templates:::level1 user-behavior-insights:::level1 %% Level 2 k-NN:::level2 geospatial:::level2 cross-cluster-replication:::level2 ml-commons:::level2 notifications-core:::level2 notifications:::level2 opensearch-observability:::level2 opensearch-reports:::level2 asynchronous-search:::level2 anomaly-detection:::level2 alerting:::level2 index-management:::level2 %% Level 3 neural-search:::level3 sql:::level3 security-analytics:::level3 %% Level 4 flow-framework:::level4 skills:::level4 search-relevance:::level4 %% Implicit deps from core OpenSearch -.-> common-utils OpenSearch -.-> opensearch-learning-to-rank-base OpenSearch -.-> opensearch-remote-metadata-sdk OpenSearch -.-> job-scheduler OpenSearch -.-> security OpenSearch -.-> custom-codecs OpenSearch -.-> performance-analyzer OpenSearch -.-> query-insights OpenSearch -.-> opensearch-system-templates OpenSearch -.-> user-behavior-insights %% Explicit dependencies common-utils --> k-NN security --> k-NN custom-codecs --> k-NN job-scheduler --> geospatial common-utils --> cross-cluster-replication common-utils --> ml-commons job-scheduler --> ml-commons opensearch-remote-metadata-sdk --> ml-commons security --> ml-commons ml-commons --> neural-search k-NN --> neural-search common-utils --> notifications-core common-utils --> notifications common-utils --> opensearch-observability common-utils --> opensearch-reports job-scheduler --> opensearch-reports ml-commons --> sql geospatial --> sql job-scheduler --> sql common-utils --> asynchronous-search common-utils --> anomaly-detection job-scheduler --> anomaly-detection common-utils --> alerting common-utils --> security-analytics alerting --> security-analytics job-scheduler --> security-analytics common-utils --> index-management job-scheduler --> index-management common-utils --> flow-framework opensearch-remote-metadata-sdk --> flow-framework ml-commons --> flow-framework k-NN --> flow-framework neural-search --> flow-framework job-scheduler --> skills anomaly-detection --> skills sql --> skills ml-commons --> skills job-scheduler --> search-relevance neural-search --> search-relevance k-NN --> search-relevance ml-commons --> search-relevance user-behavior-insights --> search-relevance %% Styles classDef core fill:#FFD700,stroke:#333,font-weight:bold classDef level1 fill:#90EE90,stroke:#333 classDef level2 fill:#87CEEB,stroke:#333 classDef level3 fill:#DDA0DD,stroke:#333 classDef level4 fill:#FFA07A,stroke:#333Defaults to 4 parallel threads to start with and adding parallel is optional in the build process for now.
Below are the "build" stats for 3.6.0 builds when ran using sequential build, parallel builds with 4 threads and parallel builds with 8 threads on jenkins.
Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.