LLM agents run as native DimOS modules. They subscribe to camera, LiDAR, odometry, and spatial memory streams and they control the robot through skills.
Human Input ──→ Agent ──→ Skill Calls ──→ Robot
(text/voice) │ (RPC)
│
subscribes to streams:
color_image, odom, spatial_memory
Agent (dimos/agents/agent.py) is a Module with:
human_input: In[str]: receives text fromhumancli,WebInput, oragent-sendagent: Out[BaseMessage]: publishes agent responses (text, tool calls, images)agent_idle: Out[bool]: signals when the agent is waiting for input
The agent uses LangGraph with a configurable LLM. The default is gpt-4o and you need to provide an OPENAI_API_KEY environment variable. On startup, it discovers all @skill-annotated methods across deployed modules via RPC and exposes them as LangChain tools.
Skills are methods decorated with @skill on any Module. The agent discovers them automatically at startup.
from dimos.agents.annotation import skill
from dimos.core.module import Module
class MySkillContainer(Module):
@skill
def wave_hello(self) -> str:
"""Wave at the nearest person."""
# ... robot control logic ...
return "Waving!"Rules:
- Parameters must be JSON-serializable primitives (
str,int,float,bool,list,dict). - Docstrings become the tool description the LLM sees. Write them clearly so the agent has sufficent context.
- The function must return a string or image which with be used by the agent to decide what to do next.
| Skill | Module | Description |
|---|---|---|
relative_move(forward, left, degrees) |
UnitreeSkillContainer |
Move robot relative to current position |
execute_sport_command(command_name) |
UnitreeSkillContainer |
Unitree sport commands (sit, stand, flip, etc.) |
wait(seconds) |
UnitreeSkillContainer |
Pause execution |
observe() |
GO2Connection |
Capture and return current camera frame |
navigate_with_text(query) |
NavigationSkillContainer |
Navigate to a location by description |
tag_location(name) |
NavigationSkillContainer |
Tag current position for later recall |
stop_navigation() |
NavigationSkillContainer |
Cancel current navigation goal |
follow_person(query) |
PersonFollowSkill |
Visual servoing to follow a described person |
stop_following() |
PersonFollowSkill |
Stop person following |
speak(text) |
SpeakSkill |
Text-to-speech through robot speakers |
where_am_i() |
GoogleMapsSkillContainer |
Current street/area from GPS |
get_gps_position_for_queries(queries) |
GoogleMapsSkillContainer |
Look up GPS coordinates |
set_gps_travel_points(points) |
GPSNavSkill |
Navigate via GPS waypoints |
map_query(query) |
OsmSkill |
Search OpenStreetMap with VLM |
There is also an MCP implementation. It replaces the Agent with two modules: McpServer and McpClient.
McpServerexposes the methods annotated with@skillas MCP tools. Any external client can connect to the server to use the MCP tools.McpClienthas a LangGraph LLM which calls MCP tools fromMcpServer.
CLI access:
dimos mcp list-tools # List available skills
dimos mcp call relative_move --arg forward=0.5 # Call a skill
dimos mcp status # Server status| Method | How it works |
|---|---|
humancli |
Standalone terminal — type messages, see responses |
dimos agent-send "text" |
One-shot CLI command via LCM |
WebInput |
Web interface at localhost:7779 with optional Whisper STT |
| Config | Model | Notes |
|---|---|---|
| Default | gpt-4o |
Best quality, requires OPENAI_API_KEY |
ollama:llama3.1 |
Local Ollama | Requires ollama serve running |
| Custom | Any LangChain-compatible | Set via AgentConfig(model="...") |