-
Notifications
You must be signed in to change notification settings - Fork 299
Description
Trying to optimize my LM but lm_optimizer.py throws NotFoundError as environment has CuDNN disabled.
Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
I want to use my GPU --'
FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:
~optuna.trial.Trial.suggest_floatinstead.
Related?
NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133
I have a bad feeling about this one.
+ python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 2023-01-22 23:18:04,503] A new study created in memory with name: no-name-0f421b63-297c-468c-b30d-8aa59857a843
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:30: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
Config.lm_alpha = trial.suggest_uniform("lm_alpha", 0, Config.lm_alpha_max)
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:31: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
Config.lm_beta = trial.suggest_uniform("lm_beta", 0, Config.lm_beta_max)
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-221133
W Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
[W 2023-01-22 23:18:05,201] Trial 0 failed with parameters: {'lm_alpha': 0.26985826312830485, 'lm_beta': 1.3371065634850314} because of the following error: NotFoundError().
Traceback (most recent call last):
File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 121, in _load_checkpoint
return _load_checkpoint_impl(
File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 21, in _load_checkpoint_impl
ckpt = tfv1.train.load_checkpoint(checkpoint_path)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 66, in load_checkpoint
return pywrap_tensorflow.NewCheckpointReader(filename)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 873, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern))
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 885, in __init__
this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133
To Reproduce
Steps to reproduce the behavior:
Full logs
Expected behavior
A study should start on the GPU for 50 trails.
Environment (please complete the following information): Docker
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker
- TensorFlow installed from (our builds, or upstream TensorFlow): 22.02-tf1
- TensorFlow version (use command below): 22.02-tf1
- Python version: 3.8
- Bazel version (if compiling from source): 5.0
- GCC/Compiler version (if compiling from source):10
- CUDA/cuDNN version:11.6.0.021
- GPU model and memory:RTX 3060 12Gb
- Exact command to reproduce:
python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
Additional context
Built using the Training Wizard for STT