-
Notifications
You must be signed in to change notification settings - Fork 4.7k
feat: criteria evals [SDK-349] #5629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| if not self._hooks_normalised: | ||
| if self.pre_hooks: | ||
| self.pre_hooks = normalize_hooks(self.pre_hooks) # type: ignore | ||
| self.pre_hooks = normalize_hooks(self.pre_hooks, hook_mode="pre") # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets rather make a normalize_pre_hooks and normalize_post_hooks?
|
|
||
| class EvalType(str, Enum): | ||
| ACCURACY = "accuracy" | ||
| CRITERIA = "criteria" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know why criteria seems like a nice name, but I think we have to be very explicit here and call it AgentAsJudge. I feel like the evals are to obfuscated as it is, and you can't really know what they do just from the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one to ask the team's opinion on
| class CriteriaJudgeResponse(BaseModel): | ||
| """Response schema for the LLM judge.""" | ||
|
|
||
| score: int = Field(..., description="Score between 1 and 10 based on the evaluation criteria.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I read that a PASS/FAIL score is much more effective. Maybe we should have both?
| # Core evaluation fields | ||
| criteria: str = "" | ||
| threshold: int = 7 | ||
| on_fail: Optional[Callable[["CriteriaEvaluation"], None]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool
| # Evaluation metadata | ||
| name: Optional[str] = None | ||
| eval_id: str = field(default_factory=lambda: str(uuid4())) | ||
| num_iterations: int = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this, which might simplify things a lot. For an agent as judge it doesn't make sense to run multiple times over the same input I think
| eval_id: str = field(default_factory=lambda: str(uuid4())) | ||
| num_iterations: int = 1 | ||
| run_id: Optional[str] = None | ||
| result: Optional[CriteriaResult] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be stateful to store a result. It should just be returned. Same with run_id
| print_summary: bool = False | ||
| print_results: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now to keep the same structure, but in future we can rather have a print() function, instead of options like these.
| try: | ||
| from agno.models.openai import OpenAIChat | ||
|
|
||
| model = OpenAIChat(id="gpt-4o-mini") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets start defaulting to gpt-5-mini?
|
|
||
| return Agent( | ||
| model=model, | ||
| description=f"""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather pass a description and then the below as instructions
| additional_guidelines = f"\n## Additional Guidelines\n{guidelines_text}\n" | ||
|
|
||
| # Format additional context | ||
| additional_context = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is needed?
| Be objective and thorough in your evaluation. | ||
| """, | ||
| output_schema=CriteriaJudgeResponse, | ||
| structured_outputs=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
| # Trigger on_fail callback if evaluation failed | ||
| if not passed and self.on_fail: | ||
| try: | ||
| self.on_fail(evaluation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_fail could be an async function. We should support both
| ) | ||
|
|
||
| result: Optional[CriteriaResult] = evaluation.run( | ||
| input="What is the capital of France?", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the examples, let's try to use inputs with no clear "correct answer", as those would ideally be evaluated with the AccuracyEval ?
| from agno.run.team import TeamRunInput, TeamRunOutput | ||
|
|
||
|
|
||
| class BaseEval(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Let's make the other evals extend this too?
| db: Optional[Union[BaseDb, AsyncBaseDb]] = None | ||
| telemetry: bool = True | ||
|
|
||
| def get_evaluator_agent(self) -> Agent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an evaluator_agent field for users to pass custom agents to use as evaluators. In that case, we simply use theirs and skip our prompts etc
| def pre_check(self, run_input: Union[RunInput, TeamRunInput]) -> None: | ||
| """Perform sync pre-check to validate input before agent runs.""" | ||
| input_str = run_input.input_content_string() if run_input else "" | ||
| output_str = "(Input validation - no output yet)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just not pass output?
In this kind of run we just want to present the input and judge that, we may be better of with a different run method for it (instead of self.run()) that uses a specific prompt
Summary
(If applicable, issue number: #____)
Type of change
Checklist
./scripts/format.shand./scripts/validate.sh)Additional Notes
Add any important context (deployment instructions, screenshots, security considerations, etc.)