Skip to content

Customizable tokenizer for RULER#1731

Merged
MaiziXiao merged 2 commits intoopen-compass:mainfrom
changlan:tokenizer
Dec 19, 2024
Merged

Customizable tokenizer for RULER#1731
MaiziXiao merged 2 commits intoopen-compass:mainfrom
changlan:tokenizer

Conversation

@changlan
Copy link
Contributor

@changlan changlan commented Dec 3, 2024

Adding an optional environment variable TOKENIZER_MODEL which controls the tokenizer model to use for RULER data generation. With this option, the dataset length will be more precise when we evaluate models that do not use gpt-4 tokenizer.

@MaiziXiao
Copy link
Contributor

https://github.com/open-compass/opencompass/blob/main/configs/eval_ruler.py
We have provided the way to use model's own tokenizer to build model specific datasets, you can have a look at the config.

On the other hand, the configuration (**_gen.py) is standard configurations for general evaluations. You are of course welcome to try your own configurations.

@changlan
Copy link
Contributor Author

changlan commented Dec 4, 2024

Thanks for the review. The general workflow we use opencompass is via the CLI: opencompass --models [custom_model_config] --datasets ruler_4k_gen.py ... However, it seems that it is not possible to specify tokenizer for --datasets. Do you think this is a reasonable use case?

@MaiziXiao
Copy link
Contributor

Thanks for the review. The general workflow we use opencompass is via the CLI: opencompass --models [custom_model_config] --datasets ruler_4k_gen.py ... However, it seems that it is not possible to specify tokenizer for --datasets. Do you think this is a reasonable use case?
That sounds like a reasonable usecase.

Copy link
Contributor

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MaiziXiao MaiziXiao merged commit d70100c into open-compass:main Dec 19, 2024
stephen-nju pushed a commit to stephen-nju/opencompass that referenced this pull request May 14, 2025
* Customizable tokenizer for RULER

* Relax requirements
zyc140345 pushed a commit to zyc140345/opencompass that referenced this pull request Oct 23, 2025
* Customizable tokenizer for RULER

* Relax requirements
iamkaia pushed a commit to iamkaia/opencompass that referenced this pull request Feb 4, 2026
* Customizable tokenizer for RULER

* Relax requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants