LegalEval-Q

A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Resource Occupancy

python = 3.12.9;

Disk usage: approximately 17G;

Approximate VRAM usage: 18G;

Configure the environment

pip install -r requirements.txt

Model Download

Qwen_7B_Review_Tuned_model : https://www.modelscope.cn/l424102993/LLM_TQ_Tuned_model.git

Regression_model_base : https://www.modelscope.cn/iic/nlp_bert_backbone_base_std.git

Regression_model_regression : https://www.modelscope.cn/l424102993/LLM_TQ_Regression_model.git

Use Git for download (choose one between Git and Magic Tower SDK)

cd data/models
git clone https://www.modelscope.cn/l424102993/LLM_TQ_Tuned_model.git
git clone https://www.modelscope.cn/iic/nlp_bert_backbone_base_std
git clone https://www.modelscope.cn/l424102993/LLM_TQ_Regression_model.git

Download using the Magic Tower SDK (recommended)

#SDK Model Download
from modelscope import snapshot_download
model_dir = snapshot_download('l424102993/LLM_TQ_Tuned_model', cache_dir = "./data/models/")
model_dir = snapshot_download('iic/nlp_bert_backbone_base_std', cache_dir = "./data/models/")
model_dir = snapshot_download('l424102993/LLM_TQ_Regression_model', cache_dir = "./data/models")
## Replace other paths

Usage Method

1. Modify the configuration file

Modify the configuration file: ./configs/config.yaml

Modify the model directory in config.ymal and specify the CUDA number based on the download model's address.

2. Start the requests scoring service

2.1 Directly use the "bass" command to start the request service

bash serve.py

2.2 Manual Startup

If you need to specify a virtual environment or other requirements, you can manually start ./src/evaluator_request.py

Note that when starting manually, you may need to modify the relative path for reading config.yaml in evaluator_request.py, as well as the relative path for the model in config.yaml

3. Add API keys

At the corresponding location in Evaluate_example.ipynb, add the API keys for Qwen, or modify the relevant code to call other models

4. Model Testing

Use Evaluate_example.ipynb to conduct a single model's scoring test once.

Batch Evaluation

Use/notebooks/Evaluate_batch.ipynb to call the API interface or the local model to conduct scoring tests for multiple models
Use the notebook "Generated_Result_Visualization_Analysis.ipynb" to visually view and compare the analysis results.

Research Results

Factors Affecting Model Quality

Model parameter size: 7B - 14B - Best value for money Due to the influence of the model training method and the training dataset, the text quality score and reasoning ability of the model approach a bottleneck around 7B-14B.

When deploying large models in vertical domains, there is no need to aim for overly large models, as their marginal effects tend to diminish significantly. Instead, it is the update of the model architecture or the quality of the original dataset/micro-tuned dataset that has a greater impact on the model.

Reasoning Model: The reasoning model is superior to the ordinary model.

Although there have been numerous papers demonstrating the effectiveness of CoT and the performance of the reasoning models, most of the arguments have focused on the differences in reasoning abilities (such as compliance with instructions, mathematical ability, etc.) of the reasoning models. However, we have found that for the quality of the output text alone, the reasoning models also perform significantly better than ordinary models.

Selection of the Best Model:

Price

Different models vary greatly in their usage prices due to their suppliers and API channels. This experiment compared the usage prices of APIs (RMB/1M tokens) with the quality of the output text. Among the currently tested models, the ones with the highest cost-performance ratio are vocles-lite, qwen-plus, and deepseek-v3-0324.

Cite

@article{li_legaleval-q_2026,
	title = {Legaleval-q: a benchmark for quality evaluation of {LLM}-generated {Chinese} legal text},
	volume = {68},
	issn = {0219-3116},
	url = {https://doi.org/10.1007/s10115-026-02703-7},
	doi = {10.1007/s10115-026-02703-7},
	abstract = {As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks focus mainly on factual accuracy while neglecting important linguistic aspects such as clarity, coherence, and terminology. To address this gap, we first develop a regression-based framework to evaluate legal text quality, second construct a specialized set of legal questions, and third analyze 49 LLMs using this framework. Our study primarily focuses on Chinese legal texts due to data availability, while the methodology itself remains language-agnostic and adaptable to other domains. We identify three key findings: (1) legal text quality plateaus at relatively small scales, with Qwen2.5 models flattening beyond 7B (72B adds only 2.7\%) and Qwen3 models showing an early plateau at 1.7B; (2) engineering choices such as quantization and context length have no statistically significant effect on legal text quality (\$\$p {\textgreater} 0.0167\$\$), supporting cost-efficient deployment; (3) reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and trade-off frontier visualization, which highlight the Qwen3 series as the optimal choice for cost–performance trade-offs. This work advances domain-specific evaluation of linguistic quality by integrating multidimensional assessment with data-driven model analysis. We additionally adopt a variance-penalized metric, AdjScore, to robustly assess model performance. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.},
	number = {1},
	journal = {Knowledge and Information Systems},
	author = {Li, Yunhan and Wu, Gengshen},
	month = feb,
	year = {2026},
	pages = {83},
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
configs		configs
data		data
images		images
notebooks		notebooks
src		src
README.MD		README.MD
README_zh.MD		README_zh.MD
requirements.txt		requirements.txt
serve.sh		serve.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LegalEval-Q

A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Resource Occupancy

Configure the environment

Model Download

Usage Method

1. Modify the configuration file

2. Start the requests scoring service

3. Add API keys

4. Model Testing

Batch Evaluation

Research Results

Factors Affecting Model Quality

Reasoning Model: The reasoning model is superior to the ordinary model.

Selection of the Best Model:

Price

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LegalEval-Q

A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text

Resource Occupancy

Configure the environment

Model Download

Usage Method

1. Modify the configuration file

2. Start the requests scoring service

3. Add API keys

4. Model Testing

Batch Evaluation

Research Results

Factors Affecting Model Quality

Reasoning Model: The reasoning model is superior to the ordinary model.

Selection of the Best Model:

Price

Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages