Skip to content

Conversation

@AymenFJA
Copy link

Adding Dragon Execution Backend

@AymenFJA AymenFJA self-assigned this Sep 30, 2025
@AymenFJA AymenFJA marked this pull request as draft September 30, 2025 18:53
@gemini-code-assist
Copy link

Summary of Changes

Hello @AymenFJA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the HPE DRAGON distributed computing framework as a new execution backend. This integration allows for the efficient execution of diverse computational tasks, ranging from single-process functions to multi-rank MPI executables, by leveraging Dragon's native process management and shared memory capabilities. The new backend streamlines task submission, result collection, and resource management within a distributed environment.

Highlights

  • New Execution Backend: Introduced a new DragonExecutionBackend for executing tasks using the HPE DRAGON distributed computing framework.
  • Task Types and Execution: The backend supports various task types including single-function, multi-function, single-executable, multi-executable, MPI-function, and MPI-executable tasks, leveraging Dragon's process and process group management.
  • Shared Memory and Data Handling: Implemented a SharedMemoryManager that utilizes Dragon's Distributed Dictionary (DDict) for efficient cross-process data sharing and managing data references based on a configurable size threshold.
  • Result Collection: A ResultCollector is included to gather results from executable tasks via a global Dragon Queue and from function tasks via DDict, ensuring proper aggregation for multi-rank executions.
  • Robust Task Management: The backend provides comprehensive lifecycle management for tasks, including submission, asynchronous monitoring, and cancellation capabilities for individual and all running tasks, with proper resource cleanup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new, comprehensive execution backend for HPE Dragon. The implementation is well-structured, covering single and multi-rank tasks for both functions and executables, and it effectively utilizes Dragon's DDict and Queues for inter-process communication. My review focuses on improving robustness and configurability. I've identified several instances of overly broad exception handling that could mask bugs, including a critical issue in the function worker that might cause tasks to hang. Additionally, several key parameters like task timeouts and DDict memory are hardcoded, which limits the backend's flexibility. Addressing these points will significantly enhance the reliability and usability of the new Dragon backend.

Comment on lines +691 to +698
except Exception:
pass

# Detach from DDict
try:
d.detach()
except Exception:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The except Exception: pass blocks in the finally clause are dangerous as they silently ignore all errors. An error during result storage (lines 689-690) is particularly critical, as it will cause the task to hang indefinitely because its completion status is never recorded. All exceptions in this block should be logged to stderr to aid debugging and prevent silent failures.

Suggested change
except Exception:
pass
# Detach from DDict
try:
d.detach()
except Exception:
pass
except Exception as e:
print(f"Dragon worker failed to store result for {task_uid}_rank_{rank}: {e}", file=sys.stderr)
# Detach from DDict
try:
d.detach()
except Exception as e:
print(f"Dragon worker failed to detach from DDict for {task_uid}_rank_{rank}: {e}", file=sys.stderr)

Comment on lines +1209 to +1212
try:
self._result_queue.get(block=False)
except:
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The except: clause is too broad and will catch any error, not just the one for an empty queue. This can mask problems during shutdown. You should catch the specific exception for an empty queue (e.g., Queue.Empty) to ensure other errors are not silently ignored.

                        except Queue.Empty:
                            break

Comment on lines +6 to +7
import dill
import cloudpickle

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dill and cloudpickle modules are imported but do not appear to be used in this file. They can be removed to keep the imports clean and reduce unnecessary dependencies.

Comment on lines +240 to +242
except Exception:
# Queue empty
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except Exception: block is too broad. It's intended to catch an exception when the queue is empty, but it will also catch and silence any other unexpected errors during self.result_queue.get(). It's safer to explicitly catch the specific exception for an empty queue (e.g., Queue.Empty) to avoid masking other potential bugs.

Suggested change
except Exception:
# Queue empty
pass
except Queue.Empty:
# Queue empty
pass

backend_kwargs = task.get('task_backend_specific_kwargs', {})
ranks = int(backend_kwargs.get("ranks", 2))

# FIXME: pass Policy and other attrs to Process/Group init

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This FIXME comment indicates that Policy and other attributes are not being passed to the ProcessGroup constructor. This suggests an incomplete implementation. It's best to either implement this functionality or create a ticket to track it and remove the FIXME from the code.

cwd=working_dir,
capture_output=True,
text=True,
timeout=3600 # 1 hour timeout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout for subprocess.run is hardcoded to 3600 seconds (1 hour). This lacks flexibility. It would be better to allow this timeout to be configured on a per-task basis, for example, through task_backend_specific_kwargs.

if not self._ddict:
self._ddict = DDict(
n_nodes=nnodes,
total_mem=nnodes * int(4 * 1024 * 1024 * 1024), # 4GB per node

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The total memory for the DDict is calculated using a hardcoded value of 4GB per node. This should be made configurable, for instance, by reading a ddict_mem_per_node value from the resources dictionary provided to the backend's constructor. This would allow users to tune resource usage for their specific hardware and workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants