Skip to content

Latest commit

 

History

History
91 lines (70 loc) · 4.01 KB

File metadata and controls

91 lines (70 loc) · 4.01 KB

Agents

LLM agents run as native DimOS modules. They subscribe to camera, LiDAR, odometry, and spatial memory streams and they control the robot through skills.

Architecture

Human Input ──→ Agent ──→ Skill Calls ──→ Robot
  (text/voice)     │         (RPC)
                   │
          subscribes to streams:
          color_image, odom, spatial_memory

Agent (dimos/agents/agent.py) is a Module with:

  • human_input: In[str]: receives text from humancli, WebInput, or agent-send
  • agent: Out[BaseMessage]: publishes agent responses (text, tool calls, images)
  • agent_idle: Out[bool]: signals when the agent is waiting for input

The agent uses LangGraph with a configurable LLM. The default is gpt-4o and you need to provide an OPENAI_API_KEY environment variable. On startup, it discovers all @skill-annotated methods across deployed modules via RPC and exposes them as LangChain tools.

Skills

Skills are methods decorated with @skill on any Module. The agent discovers them automatically at startup.

from dimos.agents.annotation import skill
from dimos.core.module import Module

class MySkillContainer(Module):
    @skill
    def wave_hello(self) -> str:
        """Wave at the nearest person."""
        # ... robot control logic ...
        return "Waving!"

Rules:

  • Parameters must be JSON-serializable primitives (str, int, float, bool, list, dict).
  • Docstrings become the tool description the LLM sees. Write them clearly so the agent has sufficent context.
  • The function must return a string or image which with be used by the agent to decide what to do next.

Built-in Skills

Skill Module Description
relative_move(forward, left, degrees) UnitreeSkillContainer Move robot relative to current position
execute_sport_command(command_name) UnitreeSkillContainer Unitree sport commands (sit, stand, flip, etc.)
wait(seconds) UnitreeSkillContainer Pause execution
observe() GO2Connection Capture and return current camera frame
navigate_with_text(query) NavigationSkillContainer Navigate to a location by description
tag_location(name) NavigationSkillContainer Tag current position for later recall
stop_navigation() NavigationSkillContainer Cancel current navigation goal
follow_person(query) PersonFollowSkill Visual servoing to follow a described person
stop_following() PersonFollowSkill Stop person following
speak(text) SpeakSkill Text-to-speech through robot speakers
where_am_i() GoogleMapsSkillContainer Current street/area from GPS
get_gps_position_for_queries(queries) GoogleMapsSkillContainer Look up GPS coordinates
set_gps_travel_points(points) GPSNavSkill Navigate via GPS waypoints
map_query(query) OsmSkill Search OpenStreetMap with VLM

MCP

There is also an MCP implementation. It replaces the Agent with two modules: McpServer and McpClient.

  • McpServer exposes the methods annotated with @skill as MCP tools. Any external client can connect to the server to use the MCP tools.
  • McpClient has a LangGraph LLM which calls MCP tools from McpServer.

CLI access:

dimos mcp list-tools                                # List available skills
dimos mcp call relative_move --arg forward=0.5      # Call a skill
dimos mcp status                                    # Server status

Input Methods

Method How it works
humancli Standalone terminal — type messages, see responses
dimos agent-send "text" One-shot CLI command via LCM
WebInput Web interface at localhost:7779 with optional Whisper STT

Models

Config Model Notes
Default gpt-4o Best quality, requires OPENAI_API_KEY
ollama:llama3.1 Local Ollama Requires ollama serve running
Custom Any LangChain-compatible Set via AgentConfig(model="...")