[cm4mlops] development plan

The feedback from the MLCommons TF on automation and reproducibility to extend CM workflows to support the following MLC projects:

- [x] check how to add network and multi-node code to MLPerf inference and CM automation (collaboration with MLC Network TF)
  - [x] extend MLPerf inference with Flask code, gluing with our ref client/server code (Python and later C++) and CM wrapping
  - [x] address suggestions from Nvidia
    - [x] --network-server=IP1,IP2...
    - [x] --network-client

- [ ] continue improving unified CM interface to run MLPerf inference implementations from different vendors
  - [ ] Optimized MLPerf inference implementations
    - [ ] Intel submissions (see [Intel docs](https://www.intel.com/content/www/us/en/developer/articles/guide/get-started-mlperf-intel-optimized-docker-images.html))
      - [x] Support installation of conda packages in CM 
    - [x] Qualcomm submission
      - [x] Add CM scripts to preprocess, calibrate and compile QAIC models for ResNet50, RetinaNet and Bert 
      - [x] Test in AWS
      - [x] Test on Thundercomm RB6
        - [x] Automatic model installation from a host device
      - [x]  Automatic detection and usage of quantization parameters
    - [x] Nvidia submission
    - [ ] Google submission
    - [x] NeuralMagic submission
  - [ ] Add possibility to run any MLPerf implementation including ref
  - [ ] Add possibility to change target device (eg GeForce instead of A100)
  - [ ] Expose batch sizes from all existing MLPerf inference reference implementations (when applicable) in edge category in a unified way for ONNX, PyTorch and TF via the CM interface. Report implementations with hardwired batch size.
  - [ ] Request from Miro: improve MLPerf inference docs for various backends

- [ ] Develop universal CM-MLPerf docker to run any implementation with local data set and model (similar to Nvidia and Intel but with a unified CM interface)
- [ ] Prototype new universal CM workflow to run any app on any target (with C++/Android/SSH)
- [ ] Add support for any ONNX+loadgen model testing with tuning (prototyped already)

- [ ] Improve CM docs (basic CM message and tutorials/notes for "users" and "developers")
- [ ] Update/improve a list of all [reusable, portable and tech-agnostic CM-MLOps scripts](https://github.com/mlcommons/ck/blob/master/docs/list_of_scripts.md)
- [x] [Improve CM logging (stdout and stderr)](https://github.com/mlcommons/ck/issues/1017)
- [ ] [Visualize CM script dependencies](https://github.com/mlcommons/ck/issues/1018)
- [ ] [Check other suggestions from student teams from SCC'23](https://github.com/mlcommons/ck/issues/1006)

- [ ] Start adding [FAQ/notes](https://access.cknowledge.org/mlperf-explorer/) from Discord/GitHub discussions about CM-MLPerf

- [ ] prototype/reuse above universal CM workflow with ABTF for 
  - [ ] **inference**
    - [ ] support different targets (host, remove embedded, Android)
    - [ ] get all info about target
    - [x] add Python and C++ code for loadgen with different backends (PyTorch, ONNX, TF, TFLite, QAIC)
    - [x] add object detection with COCO and trained model from Rod (without accuracy for now)
    - [ ] connect with training CM workflow
  - [ ]  **training** (https://github.com/mlcommons/abtf-ssd-pytorch)
    - [x] present CM-MLPerf at Croissant TF and discuss possible collaboration ([doc](https://doi.org/10.5281/zenodo.10183494))
    - [x] add CM script to get Croissant
    - [ ] add datasets via Croissant
    - [x] train and save model in CM cache to be loaded to inference
    - [x] test with Rod
  - [x] present prototype progress in next ABTF meeting (Grigori)

- [ ] unify experiment and visualization
     - [ ] prepare high-level meta to run the whole experiment
     - [ ]aggregate and visualize results
     - [ ] if MLPerf run is very short, we need to kind of calibrate it by multiplting N*10 for example similar to what I did in CK


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cm4mlops] development plan #1023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[cm4mlops] development plan #1023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions