Skip to content

Track high level performance of training in mlflow #1930

@tjhunter

Description

@tjhunter

Describe the task. Describe the task. It can be a feature, a set of experiments, documentation, etc.

We print in the logs some telemetry bits with the duration of various training tasks (reading data, running the encoder, decoder, etc.). We should collect them in mlflow to track performance over time. The current bottlenecks are unclear to me.

The goal of this issue:

  • propose a few key durations investigate
  • propose the mlflow schema to store them
  • implement it in the training pipeline

Guidelines:

  • just a few high level metrics. Going deeper can be done with NV nsigts + pytorch profiler + flamegraphs etc.
  • we already log the config, including all the information with number of channels etc. no need log that

Hedgedoc URL, if you are keeping notes, plots, logs in hedgedoc.

No response

URL to the design document

No response

Area

  • datasets, data readers, data preparation and transfer
  • model
  • science
  • infrastructure and engineering
  • evaluation, export and visualization
  • documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    infraIssues related to infrastructureinitiativeLarge piece of work covering multiple sprintperformanceWork related to performance improvements

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions