Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint`

Trying to optimize my LM but `lm_optimizer.py` throws `NotFoundError` as environment has CuDNN disabled.
> Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.

*I want to use my GPU --'*

> FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.

*Related?*

> NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

*I have a bad feeling about this one.*
```
+ python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 2023-01-22 23:18:04,503] A new study created in memory with name: no-name-0f421b63-297c-468c-b30d-8aa59857a843
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:30: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_alpha = trial.suggest_uniform("lm_alpha", 0, Config.lm_alpha_max)
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:31: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_beta = trial.suggest_uniform("lm_beta", 0, Config.lm_beta_max)
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-221133
W Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
[W 2023-01-22 23:18:05,201] Trial 0 failed with parameters: {'lm_alpha': 0.26985826312830485, 'lm_beta': 1.3371065634850314} because of the following error: NotFoundError().
Traceback (most recent call last):
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 121, in _load_checkpoint
    return _load_checkpoint_impl(
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 21, in _load_checkpoint_impl
    ckpt = tfv1.train.load_checkpoint(checkpoint_path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 873, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 885, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133
```

**To Reproduce**
Steps to reproduce the behavior:
[Full logs](https://gist.github.com/wasertech/5e9b453995406f7136f23693f706e1bc)

**Expected behavior**
A study should start on the GPU for 50 trails.

**Environment (please complete the following information):** Docker
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Docker
- **TensorFlow installed from (our builds, or upstream TensorFlow)**: 22.02-tf1
- **TensorFlow version (use command below)**: 22.02-tf1
- **Python version**: 3.8
- **Bazel version (if compiling from source)**: 5.0
- **GCC/Compiler version (if compiling from source)**:10
- **CUDA/cuDNN version**:11.6.0.021
- **GPU model and memory**:RTX 3060 12Gb
- **Exact command to reproduce**: `python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint`

**Additional context**
Built using the [Training Wizard for STT](https://gitlab.com/waser-technologies/models/fr/stt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvment: NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for best_dev_checkpoint #2338

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338