-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Labels
infraIssues related to infrastructureIssues related to infrastructureinitiativeLarge piece of work covering multiple sprintLarge piece of work covering multiple sprintperformanceWork related to performance improvementsWork related to performance improvements
Milestone
Description
Describe the task. Describe the task. It can be a feature, a set of experiments, documentation, etc.
We print in the logs some telemetry bits with the duration of various training tasks (reading data, running the encoder, decoder, etc.). We should collect them in mlflow to track performance over time. The current bottlenecks are unclear to me.
The goal of this issue:
- propose a few key durations investigate
- propose the mlflow schema to store them
- implement it in the training pipeline
Guidelines:
- just a few high level metrics. Going deeper can be done with NV nsigts + pytorch profiler + flamegraphs etc.
- we already log the config, including all the information with number of channels etc. no need log that
Hedgedoc URL, if you are keeping notes, plots, logs in hedgedoc.
No response
URL to the design document
No response
Area
- datasets, data readers, data preparation and transfer
- model
- science
- infrastructure and engineering
- evaluation, export and visualization
- documentation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
infraIssues related to infrastructureIssues related to infrastructureinitiativeLarge piece of work covering multiple sprintLarge piece of work covering multiple sprintperformanceWork related to performance improvementsWork related to performance improvements
Type
Projects
Status
No status