web_questions test.jsonl is not Xet in HGF

Context https://github.com/terrierteam/pyterrier_rag/issues/47

the test.jsonl file in 
https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/web_questions
seems to be unable to parse from Pandas. I think this is because its not in Xet format?

This works:
```python
pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/train.jsonl", lines=True)
```

This doesnt:
```python
pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/test.jsonl", lines=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 905, in __init__
    self.data = self._preprocess_data(data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 917, in _preprocess_data
    data = data.read()
           ^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 12: invalid continuation byte
```

Workaround:
```python
pd.read_json("https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/resolve/main/web_questions/test.jsonl", lines=True)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web_questions test.jsonl is not Xet in HGF #211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

web_questions test.jsonl is not Xet in HGF #211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions