GitHub - VelikayaScarlet/McBE: McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

😈McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models😇

Tian Lan^1,2,3, Xiangdong Su*^1,2,3, Xu Liu^1,2,3, Ruirui Wang^1,2,3, Ke Chang^1,2,3, Jiang Li^1,2,3, Guanglai Gao^1,2,3

¹College of Computer Science, Inner Mongolia University, China
²National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, China
³Inner Mongolia Key Laboratory of Multilingual Artiffcial Intelligence Technology, China

* corresponding author

Paper: McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Dataset: https://huggingface.co/datasets/Velikaya/McBE

Code: https://github.com/VelikayaScarlet/McBE

📜Abstract

🚀Dataset Description

McBE is designed to address the scarcity of Chinese-centric bias evaluation resources for large language models (LLMs). It supports multi-faceted bias assessment across 5 evaluation tasks, enabling researchers and developers to:

Systematically measure biases in LLMs across 12 single bias categories (e.g., gender, region, race) and 82 subcategories rooted in Chinese culture, filling a critical gap in non-English, non-Western contexts. Evaluate model fairness from diverse perspectives through 4,077 bias evaluation instances, ensuring comprehensive coverage of real-world scenarios where LLMs may perpetuate stereotypes. Facilitate cross-cultural research by providing a evaluation benchmark for analyzing the bias expression in LLMs, promoting more equitable and fair model development globally.

Curated by: College of Computer Science and National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian at Inner Mongolia University

🔬Dependencies

tqdm
zhipuai
openai
transformers
pandas
itertools
torch
modelscope
openpyxl

💯How to Run a Evaluation?

Open utils.py and fill in your GLM4-AIR API key on line 9. You can also use other LLMs to serve as LLM Judge.
Open load_model.py and replace model_dir with the path to your models in lines 6–12.
Open eval.py and update the path parameter to your local directory. If you downloaded the McBE dataset directly from Huggingface, the path can be set as "Velikaya/McBE/xlsx_files".
Edit the categories list in eval.py to specify which bias categories to evaluate:

categories = [
    "test",  # Add categories you want to test
    # Example: "age", "gender", "race", etc.
]

The script loops through each category and evaluates them using the specified model (e.g., "qwen2"). You can modify the model name in the function calls:

for c in categories:
    print(c)
    preference_computation(c, "qwen2")  # Replace "qwen2" with your model
    classification(c, "qwen2")
    scenario_selection(c, "qwen2")
    bias_analysis(c, "qwen2")
    bias_scoring(c, "qwen2")
]

6.Run the eval.py

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
content		content
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
load_model.py		load_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😈McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models😇

📜Abstract

🚀Dataset Description

🔬Dependencies

💯How to Run a Evaluation?

About

Uh oh!

Releases

Packages

Languages

License

VelikayaScarlet/McBE

Folders and files

Latest commit

History

Repository files navigation

😈McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models😇

📜Abstract

🚀Dataset Description

🔬Dependencies

💯How to Run a Evaluation?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages