You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running A0 with self-hosted LLM's (llama.cpp on strix-halo, 124GB unified VRAM) i encountered many issues, most of them making A0 unusable for various reasons. I try to do summary of what is required:
Context length limitation is completely ignored in all possible ways (this is actually regression, was better in 1.4 or so):
By all logic (including discord support bot, which, i believe, runs gemini-flash) - utility LLM should have small context window (it gets just small pieces of work), but instead - if gets called first, then gets entire question, memory context and tries to summarize. So if we set context window on model to 8K and cap usage to 0.5 in UI - it still overflows immediately, leading to memory errors.
Same situation occurs on late phase, when memorisation is called - entire chat context is feeded into utility worker, independent on context settings and memorizing fails
The only way to get around this "initial call" is to set utility worker context to 256K (64K, 128K works with smaller work but still fails on later phase)
Entire thing works well with commercial API's as most of them have huge context window (which is wrapped asap later), but on local LLM's it fails. Proposal: make context limit cap actually matter, do not feed entire chat history and prompts to utility agent - let chat agent do that work.
There's no way to trace "which agent executed what". UI does not make remarks which agent is doing query/job and logfile (HTML!!!) does not contains timestamps (to match with litellm or llama.cpp logs). All we get is "something failed somewhere", with no actual query executed.
Proposal:
add symbol to UI next to query to indicate which agent is doing job (for example, "USE A0**-U**: Using tool 'text_editor<...>"
add timestamps to log file
if error occured - display LLM response or prompt, not only stack trace (it is in log file, but again, searching HTML log file isn't easiest thing to do)
expose prompt configuration to UI, or at least - make easier to insert basic instructions. Proposal: add one field on agent configuration with text input - contents will be prepended to constructed system prompt (for each agent individually)
My test case to check how it's working - LLM box, VM nearby with docker containers. Each time starting new project, adding only IP's to memories. Test query:
create "wordpress" installation on docker containers:
1 container for database + wordpress webserver
1 container for reverse proxy
expose reverse proxy outside to port 8233, leave other container unexposed.
Keep resources at minimum, we aim for 10 visitors/day.
Wordpress should run at default settings, show it's website (not configuration dialog), show administrator password in chat
Verify result with curl before returning with success status.
Mandatory tasks:
a) plan task execution and list it in chat. Draw ASCII diagram how everything is connected
b) verify that content is actually served before confirming to user
It's really trivial for hosted work, but currently impossible to do on self-hosted - i tested 32 models in total. Best combo which was working on anything smaller so far:
This combo can do most smaller projects, but fails on query above with various conditions:
if i give 8K context window (on llama.cpp side and in GUI) to utility worker - it fails immediately, due to oversized context (feeding memory with all container paths, keys + task prompt + prepared system prompt = easy overflow)
if i give 256K context window to utility worker - it can process all prompts, all fine - but fails on later phases due to tool usage (let's call this technical limitation - model can't "focus attention" with huge context window)
it's possible to give ~64K of context window, reduce temperature and top-k - to go until last phase (verification) where it fails due to mistake in reverse proxy config. Then it enters endless loop where utility worker (? - again, it's unclear due to missing info which is executing what) is writing plan and chat agent rejects it because it's the same as previously and not enough temperature to break out of loop.
When running A0 with self-hosted LLM's (llama.cpp on strix-halo, 124GB unified VRAM) i encountered many issues, most of them making A0 unusable for various reasons. I try to do summary of what is required:
By all logic (including discord support bot, which, i believe, runs gemini-flash) - utility LLM should have small context window (it gets just small pieces of work), but instead - if gets called first, then gets entire question, memory context and tries to summarize. So if we set context window on model to 8K and cap usage to 0.5 in UI - it still overflows immediately, leading to memory errors.
Same situation occurs on late phase, when memorisation is called - entire chat context is feeded into utility worker, independent on context settings and memorizing fails
The only way to get around this "initial call" is to set utility worker context to 256K (64K, 128K works with smaller work but still fails on later phase)
Entire thing works well with commercial API's as most of them have huge context window (which is wrapped asap later), but on local LLM's it fails.
Proposal: make context limit cap actually matter, do not feed entire chat history and prompts to utility agent - let chat agent do that work.
Proposal:
add symbol to UI next to query to indicate which agent is doing job (for example, "USE A0**-U**: Using tool 'text_editor<...>"
add timestamps to log file
if error occured - display LLM response or prompt, not only stack trace (it is in log file, but again, searching HTML log file isn't easiest thing to do)
My test case to check how it's working - LLM box, VM nearby with docker containers. Each time starting new project, adding only IP's to memories. Test query:
It's really trivial for hosted work, but currently impossible to do on self-hosted - i tested 32 models in total. Best combo which was working on anything smaller so far:
This combo can do most smaller projects, but fails on query above with various conditions:
Bot hallucinations about context question, not sure if this matters: https://discord.com/channels/1255926983745998948/1493854452900429824