-
Notifications
You must be signed in to change notification settings - Fork 1
Adding HPE DRAGON as execution backend #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @AymenFJA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the HPE DRAGON distributed computing framework as a new execution backend. This integration allows for the efficient execution of diverse computational tasks, ranging from single-process functions to multi-rank MPI executables, by leveraging Dragon's native process management and shared memory capabilities. The new backend streamlines task submission, result collection, and resource management within a distributed environment. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new, comprehensive execution backend for HPE Dragon. The implementation is well-structured, covering single and multi-rank tasks for both functions and executables, and it effectively utilizes Dragon's DDict and Queues for inter-process communication. My review focuses on improving robustness and configurability. I've identified several instances of overly broad exception handling that could mask bugs, including a critical issue in the function worker that might cause tasks to hang. Additionally, several key parameters like task timeouts and DDict memory are hardcoded, which limits the backend's flexibility. Addressing these points will significantly enhance the reliability and usability of the new Dragon backend.
| except Exception: | ||
| pass | ||
|
|
||
| # Detach from DDict | ||
| try: | ||
| d.detach() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except Exception: pass blocks in the finally clause are dangerous as they silently ignore all errors. An error during result storage (lines 689-690) is particularly critical, as it will cause the task to hang indefinitely because its completion status is never recorded. All exceptions in this block should be logged to stderr to aid debugging and prevent silent failures.
| except Exception: | |
| pass | |
| # Detach from DDict | |
| try: | |
| d.detach() | |
| except Exception: | |
| pass | |
| except Exception as e: | |
| print(f"Dragon worker failed to store result for {task_uid}_rank_{rank}: {e}", file=sys.stderr) | |
| # Detach from DDict | |
| try: | |
| d.detach() | |
| except Exception as e: | |
| print(f"Dragon worker failed to detach from DDict for {task_uid}_rank_{rank}: {e}", file=sys.stderr) |
| try: | ||
| self._result_queue.get(block=False) | ||
| except: | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import dill | ||
| import cloudpickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| except Exception: | ||
| # Queue empty | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The except Exception: block is too broad. It's intended to catch an exception when the queue is empty, but it will also catch and silence any other unexpected errors during self.result_queue.get(). It's safer to explicitly catch the specific exception for an empty queue (e.g., Queue.Empty) to avoid masking other potential bugs.
| except Exception: | |
| # Queue empty | |
| pass | |
| except Queue.Empty: | |
| # Queue empty | |
| pass |
| backend_kwargs = task.get('task_backend_specific_kwargs', {}) | ||
| ranks = int(backend_kwargs.get("ranks", 2)) | ||
|
|
||
| # FIXME: pass Policy and other attrs to Process/Group init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cwd=working_dir, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=3600 # 1 hour timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if not self._ddict: | ||
| self._ddict = DDict( | ||
| n_nodes=nnodes, | ||
| total_mem=nnodes * int(4 * 1024 * 1024 * 1024), # 4GB per node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total memory for the DDict is calculated using a hardcoded value of 4GB per node. This should be made configurable, for instance, by reading a ddict_mem_per_node value from the resources dictionary provided to the backend's constructor. This would allow users to tune resource usage for their specific hardware and workload.
Adding Dragon Execution Backend