fix: agent initialization and added resilience mechanisms#970
fix: agent initialization and added resilience mechanisms#970seoeaa wants to merge 11 commits intoagentscope-ai:mainfrom
Conversation
feat(console): add Russian language support
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and robustness of the agent framework by addressing critical initialization bugs and integrating comprehensive resilience mechanisms. It ensures that both LLM reasoning and MCP tool interactions can gracefully recover from transient network errors and other temporary failures, leading to a more reliable and fault-tolerant system. The changes also introduce configurable parameters, allowing for fine-tuning of these resilience features. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces important resilience mechanisms by adding retry logic for MCP client registration and LLM reasoning calls. The configuration is also updated to allow tuning these new mechanisms. The changes are well-implemented and include new tests to verify the resilience of LLM calls.
My review includes a few suggestions for improvement:
- A performance improvement to avoid reloading configuration on every LLM reasoning call.
- A suggestion to make the retry delay for MCP clients configurable for consistency with the LLM retry configuration. This involves changes across three files.
- A recommendation to use exponential backoff for LLM retries, consistent with the strategy used for MCP clients, to make it more robust.
| client_name = getattr(client, "name", f"Client#{i}") | ||
| rebuild_info = getattr(client, "_copaw_rebuild_info", {}) | ||
| max_retries = rebuild_info.get("retries", 3) | ||
| base_delay = 1.0 |
There was a problem hiding this comment.
| ) | ||
|
|
||
| return await super()._reasoning(tool_choice=tool_choice) | ||
| config = load_config() |
There was a problem hiding this comment.
Calling load_config() within _reasoning can lead to performance issues, as it performs file I/O on every call. This configuration is unlikely to change during the agent's reasoning process. It would be more efficient to load the configuration once during the CoPawAgent's initialization in __init__ and cache the llm_retries and llm_retry_delay values as instance attributes.
| if self._is_transient_error(e) and attempt < retries: | ||
| logger.warning( | ||
| "LLM reasoning transient error (attempt %d/%d): %s. " | ||
| "Retrying in %.1fs...", | ||
| attempt + 1, | ||
| retries + 1, | ||
| self._get_safe_error_msg(e), | ||
| delay, | ||
| ) | ||
| await asyncio.sleep(delay) | ||
| continue |
There was a problem hiding this comment.
The retry logic for LLM reasoning uses a fixed delay. For better resilience against issues like rate limiting, consider implementing exponential backoff, similar to the strategy used in register_mcp_clients. This would make the retry strategy more robust and consistent within the agent.
| if self._is_transient_error(e) and attempt < retries: | |
| logger.warning( | |
| "LLM reasoning transient error (attempt %d/%d): %s. " | |
| "Retrying in %.1fs...", | |
| attempt + 1, | |
| retries + 1, | |
| self._get_safe_error_msg(e), | |
| delay, | |
| ) | |
| await asyncio.sleep(delay) | |
| continue | |
| if self._is_transient_error(e) and attempt < retries: | |
| retry_delay = delay * (2**attempt) | |
| logger.warning( | |
| "LLM reasoning transient error (attempt %d/%d): %s. " | |
| "Retrying in %.1fs...", | |
| attempt + 1, | |
| retries + 1, | |
| self._get_safe_error_msg(e), | |
| retry_delay, | |
| ) | |
| await asyncio.sleep(retry_delay) | |
| continue |
| "env": dict(client_config.env), | ||
| "cwd": client_config.cwd or None, | ||
| "timeout": client_config.timeout, | ||
| "retries": client_config.retries, |
| env: Dict[str, str] = Field(default_factory=dict) | ||
| cwd: str = "" | ||
| timeout: float = Field(default=10.0, ge=0.0) | ||
| retries: int = Field(default=3, ge=0) |
There was a problem hiding this comment.
For consistency with AgentsRunningConfig which has llm_retry_delay, consider adding a retry_delay field to MCPClientConfig to control the backoff delay for MCP client registration retries. The base delay is currently hardcoded in react_agent.py.
| retries: int = Field(default=3, ge=0) | |
| retries: int = Field(default=3, ge=0) | |
| retry_delay: float = Field(default=1.0, ge=0.0, description="Base delay in seconds between MCP client retry attempts.") |
|
Hi @seoeaa, thanks for the PR! I tested the branch locally and found a few issues: Bugs found:
Happy to re-review after these are addressed! |
Description
This PR fixes critical initialization errors in
CoPawAgentandAgentRunner, and implements resilience mechanisms (retries and timeouts) for LLM and MCP tool interactions.Key changes:
TypeErrorinrunner.pyby ensuringenv_contextis correctly passed as a string duringCoPawAgentinitialization.AttributeErrorinAgentRunnerby adding the missingSafeJSONSessioninitialization.config.pywith configurable parameters for LLM retries (llm_retries,llm_retry_delay) and MCP client timeouts/retries.react_agent.pyandmanager.pywith retry logic for transient errors such as SSL failures and timeouts.Linked Issue: Relates to improving application stability and initialization in
pipxenvironments.Security Considerations: Changes focus on configuration handling and network request resilience. No changes were made to authentication or credential management.
Type of Change
Component(s) Affected
Checklist
pre-commit run --all-fileslocally and it passed.trigger_query.py), initialization is successful.Testing
pipxenvironment to ensure theTypeErroris resolved.trigger_query.py.test_llm_resilience.py.Proof of Local Verification
pre-commit run --all-files
Additional Notes
Skill-specific details are omitted as these fixes address systemic initialization issues affecting the entire agent framework.