-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Error with detecting cached files when running without Internet connection (related to #10067) #10901
Description
Environment info
transformersversion: 4.5.0.dev0- Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
@LysandreJik (related to #10235 and #10067)
Information
I'm trying to run
from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")
from an environment without Internet access. It crashes even though I have all files downloaded and cached. The uncaught exception:
transformers/src/transformers/file_utils.py
Lines 1347 to 1350 in 5f1491d
| raise ValueError( | |
| "Connection error, and we cannot find the requested files in the cached path." | |
| " Please try again or make sure your Internet connection is on." | |
| ) |
When file_id == 'added_tokens_file' file_path equals https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/added_tokens.json which does not exist. (
| for file_id, file_path in vocab_files.items(): |
This results in line
transformers/src/transformers/file_utils.py
Line 1294 in 1a3e0c4
| r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout) |
ConnectTimeout which is caught in transformers/src/transformers/file_utils.py
Line 1313 in 1a3e0c4
| except (requests.exceptions.ConnectionError, requests.exceptions.Timeout): |
and further ignored until another exception in
| raise error |
which is not caught enywhere.
When trying to get the same file with the internet is on the code work differently: line
transformers/src/transformers/file_utils.py
Line 1295 in 1a3e0c4
| r.raise_for_status() |
requests.exceptions.HTTPError, which is caught and processed here transformers/src/transformers/tokenization_utils_base.py
Lines 1674 to 1677 in 1a3e0c4
| except requests.exceptions.HTTPError as err: | |
| if "404 Client Error" in str(err): | |
| logger.debug(err) | |
| resolved_vocab_files[file_id] = None |
The rest of the code works just fine after
resolved_vocab_files[file_id] = None
Using BertTokenizer.from_pretrained(bert_version, local_files_only=True) works just fine because of this condition:
transformers/src/transformers/tokenization_utils_base.py
Lines 1668 to 1672 in 1a3e0c4
| except FileNotFoundError as error: | |
| if local_files_only: | |
| unresolved_files.append(file_id) | |
| else: | |
| raise error |
The current workaround is to use BertTokenizer.from_pretrained(bert_version, local_files_only=True) but this does not allow to use same code with and without Internet.
To reproduce
Steps to reproduce the behavior:
Run
from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")
from env without internet but all the required cache files pre-downloaded.
Expected behavior
Works exactly as
from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking", local_files_only=True)