Skip to content

web_questions test.jsonl is not Xet in HGF #211

@cmacdonald

Description

@cmacdonald

Context terrierteam/pyterrier_rag#47

the test.jsonl file in
https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/web_questions
seems to be unable to parse from Pandas. I think this is because its not in Xet format?

This works:

pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/train.jsonl", lines=True)

This doesnt:

pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/test.jsonl", lines=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 905, in __init__
    self.data = self._preprocess_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 917, in _preprocess_data
    data = data.read()
           ^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 12: invalid continuation byte

Workaround:

pd.read_json("https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/resolve/main/web_questions/test.jsonl", lines=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions