INF-314 AI prometheus metrics to set correct "pipeline" and "model_name" labels#3699
Conversation
20729c6 to
358e217
Compare
0e4b4ce to
c3a5874
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3699 +/- ##
===================================================
- Coverage 31.92117% 31.90880% -0.01237%
===================================================
Files 156 156
Lines 47445 47454 +9
===================================================
- Hits 15145 15142 -3
- Misses 31405 31417 +12
Partials 895 895
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
c3a5874 to
93f598f
Compare
leszko
left a comment
There was a problem hiding this comment.
Looks good. My only concern is that we all the time assuming that the O is running a single pipeline. It doesn't need to be in this PR, but we can think how to make it working for multiple pipelines. On other words, how to make GetCapacity() working with the pipeline param.
ai/worker/docker.go
Outdated
| monitor.AIContainersInUse(capacity.ContainersInUse, "", "") | ||
| monitor.AIContainersIdle(capacity.ContainersIdle, "", "") | ||
| monitor.AIGPUsIdle(len(m.gpus) - len(m.gpuContainers)) // Indicates a misconfiguration so we should alert on this | ||
| monitor.AIContainersInUse(capacity.ContainersInUse, pipeline) |
There was a problem hiding this comment.
Note that this will only work if a given Orchestrator is running one single pipeline. It won't work if the O is running 2 separate pipelines. I think it may be ok for now, because in our infra we're using only a single pipeline for an orchestrator. But at least add a comment and some // TODO. We should work on supporting multiple pipelines in the metrics, because I believe that soon we'll have more pipelines.
There was a problem hiding this comment.
I think that each pipeline+model would have its own DockerManager reporting metrics separately, right?
There was a problem hiding this comment.
No, unfortunately, DockerManager is shared. So you can have 1 O running different pipelines.
93f598f to
feb6065
Compare
server/ai_mediaserver.go
Outdated
| }) | ||
|
|
||
| clog.V(common.VERBOSE).Infof(ctx, "AI Live video attempt") | ||
| monitor.AILiveVideoAttempt(pipeline) |
There was a problem hiding this comment.
This pipeline is actually our modelID 🤦🏻
033e559 to
8786b0e
Compare
81d9d25 to
1b3534c
Compare
1b3534c to
3e2dc61
Compare
| monitor.AIContainersIdle(capacity.ContainersIdle, "", "") | ||
| monitor.AIGPUsIdle(len(m.gpus) - len(m.gpuContainers)) // Indicates a misconfiguration so we should alert on this | ||
| monitor.AIContainersInUse(capacity.ContainersInUse, pipeline, modelID) | ||
| monitor.AIContainersIdle(capacity.ContainersIdle, pipeline, modelID, "") |
There was a problem hiding this comment.
so here (worker) ai_container_idle is produced with pipeline and model_name and here (orch.) with just model and orchestratorUri. Should we have separate metrics from O and W?
There was a problem hiding this comment.
Now, I think it would be better to have a separate metric from Gateway and a separate from Orchestrator. Because it's confusing right now.
Anyway, I'm ok if it's done later, as a separate PR.
What does this pull request do? Explain your changes. (required)
Normalising AI metrics to have correct
pipeline,model_nameandorchestratorUritags.Specific updates (required)
ai_container_in_useto set correctmode_namelabel (orchestratorUrigot removed)ai_container_idleto set correctmode_namelabelai_gpus_idleto setmode_namelabel (orchestratorUrigot removed)ai_current_live_pipelinesto setmodel_namelabel (orchestratorUrigot removed)ai_live_attemptto introducemodel_namelabelHow did you test each of these updates (required)
...
Does this pull request close any open issues?
no
Checklist:
makeruns successfully./test.shpass