Skip to content

Error with detecting cached files when running without Internet connection (related to #10067) #10901

@aosokin

Description

@aosokin

Environment info

  • transformers version: 4.5.0.dev0
  • Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik (related to #10235 and #10067)

Information

I'm trying to run

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")

from an environment without Internet access. It crashes even though I have all files downloaded and cached. The uncaught exception:

raise ValueError(
"Connection error, and we cannot find the requested files in the cached path."
" Please try again or make sure your Internet connection is on."
)

When file_id == 'added_tokens_file' file_path equals https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/added_tokens.json which does not exist. (

for file_id, file_path in vocab_files.items():
)
This results in line
r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
throwing ConnectTimeout which is caught in
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):

and further ignored until another exception in
which is not caught enywhere.

When trying to get the same file with the internet is on the code work differently: line

r.raise_for_status()
throws requests.exceptions.HTTPError, which is caught and processed here
except requests.exceptions.HTTPError as err:
if "404 Client Error" in str(err):
logger.debug(err)
resolved_vocab_files[file_id] = None

The rest of the code works just fine after resolved_vocab_files[file_id] = None

Using BertTokenizer.from_pretrained(bert_version, local_files_only=True) works just fine because of this condition:

except FileNotFoundError as error:
if local_files_only:
unresolved_files.append(file_id)
else:
raise error

The current workaround is to use BertTokenizer.from_pretrained(bert_version, local_files_only=True) but this does not allow to use same code with and without Internet.

To reproduce

Steps to reproduce the behavior:

Run

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")

from env without internet but all the required cache files pre-downloaded.

Expected behavior

Works exactly as

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking", local_files_only=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions