Context terrierteam/pyterrier_rag#47
the test.jsonl file in
https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/web_questions
seems to be unable to parse from Pandas. I think this is because its not in Xet format?
This works:
pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/train.jsonl", lines=True)
This doesnt:
pd.read_json("hf://datasets/RUC-NLPIR/FlashRAG_datasets/web_questions/test.jsonl", lines=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
json_reader = JsonReader(
^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 905, in __init__
self.data = self._preprocess_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/pandas/io/json/_json.py", line 917, in _preprocess_data
data = data.read()
^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 12: invalid continuation byte
Workaround:
pd.read_json("https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/resolve/main/web_questions/test.jsonl", lines=True)
Context terrierteam/pyterrier_rag#47
the test.jsonl file in
https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/web_questions
seems to be unable to parse from Pandas. I think this is because its not in Xet format?
This works:
This doesnt:
Workaround: