python = 3.12.9;
Disk usage: approximately 17G;
Approximate VRAM usage: 18G;
pip install -r requirements.txtQwen_7B_Review_Tuned_model : https://www.modelscope.cn/l424102993/LLM_TQ_Tuned_model.git
Regression_model_base : https://www.modelscope.cn/iic/nlp_bert_backbone_base_std.git
Regression_model_regression : https://www.modelscope.cn/l424102993/LLM_TQ_Regression_model.git
Use Git for download (choose one between Git and Magic Tower SDK)
cd data/models
git clone https://www.modelscope.cn/l424102993/LLM_TQ_Tuned_model.git
git clone https://www.modelscope.cn/iic/nlp_bert_backbone_base_std
git clone https://www.modelscope.cn/l424102993/LLM_TQ_Regression_model.gitDownload using the Magic Tower SDK (recommended)
#SDK Model Download
from modelscope import snapshot_download
model_dir = snapshot_download('l424102993/LLM_TQ_Tuned_model', cache_dir = "./data/models/")
model_dir = snapshot_download('iic/nlp_bert_backbone_base_std', cache_dir = "./data/models/")
model_dir = snapshot_download('l424102993/LLM_TQ_Regression_model', cache_dir = "./data/models")
## Replace other pathsModify the configuration file: ./configs/config.yaml
Modify the model directory in config.ymal and specify the CUDA number based on the download model's address.
2.1 Directly use the "bass" command to start the request service
bash serve.py2.2 Manual Startup
If you need to specify a virtual environment or other requirements, you can manually start ./src/evaluator_request.py
Note that when starting manually, you may need to modify the relative path for reading config.yaml in evaluator_request.py, as well as the relative path for the model in config.yaml
At the corresponding location in Evaluate_example.ipynb, add the API keys for Qwen, or modify the relevant code to call other models
Use Evaluate_example.ipynb to conduct a single model's scoring test once.
-
Use/notebooks/Evaluate_batch.ipynb to call the API interface or the local model to conduct scoring tests for multiple models
-
Use the notebook "Generated_Result_Visualization_Analysis.ipynb" to visually view and compare the analysis results.
Model parameter size: 7B - 14B - Best value for money Due to the influence of the model training method and the training dataset, the text quality score and reasoning ability of the model approach a bottleneck around 7B-14B.
When deploying large models in vertical domains, there is no need to aim for overly large models, as their marginal effects tend to diminish significantly. Instead, it is the update of the model architecture or the quality of the original dataset/micro-tuned dataset that has a greater impact on the model.
Although there have been numerous papers demonstrating the effectiveness of CoT and the performance of the reasoning models, most of the arguments have focused on the differences in reasoning abilities (such as compliance with instructions, mathematical ability, etc.) of the reasoning models. However, we have found that for the quality of the output text alone, the reasoning models also perform significantly better than ordinary models.
Different models vary greatly in their usage prices due to their suppliers and API channels. This experiment compared the usage prices of APIs (RMB/1M tokens) with the quality of the output text. Among the currently tested models, the ones with the highest cost-performance ratio are vocles-lite, qwen-plus, and deepseek-v3-0324.
@article{li_legaleval-q_2026,
title = {Legaleval-q: a benchmark for quality evaluation of {LLM}-generated {Chinese} legal text},
volume = {68},
issn = {0219-3116},
url = {https://doi.org/10.1007/s10115-026-02703-7},
doi = {10.1007/s10115-026-02703-7},
abstract = {As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks focus mainly on factual accuracy while neglecting important linguistic aspects such as clarity, coherence, and terminology. To address this gap, we first develop a regression-based framework to evaluate legal text quality, second construct a specialized set of legal questions, and third analyze 49 LLMs using this framework. Our study primarily focuses on Chinese legal texts due to data availability, while the methodology itself remains language-agnostic and adaptable to other domains. We identify three key findings: (1) legal text quality plateaus at relatively small scales, with Qwen2.5 models flattening beyond 7B (72B adds only 2.7\%) and Qwen3 models showing an early plateau at 1.7B; (2) engineering choices such as quantization and context length have no statistically significant effect on legal text quality (\$\$p {\textgreater} 0.0167\$\$), supporting cost-efficient deployment; (3) reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and trade-off frontier visualization, which highlight the Qwen3 series as the optimal choice for cost–performance trade-offs. This work advances domain-specific evaluation of linguistic quality by integrating multidimensional assessment with data-driven model analysis. We additionally adopt a variance-penalized metric, AdjScore, to robustly assess model performance. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.},
number = {1},
journal = {Knowledge and Information Systems},
author = {Li, Yunhan and Wu, Gengshen},
month = feb,
year = {2026},
pages = {83},
}


