We fine-tuned Baichuan2 on a corpus consisting of 14988 short stories from STORAL, 6500 news from THUCNews, 919 documents from WikiPedia, and 27 novels from modern Chinese literature.
$ git clone git@github.com:xgao922/Baichuan2-finetuning.git
$ pip install -r requirements.txtThe training data should be placed at /data, and we preprocess the corpus by removing the punctuation and carefully dealing with the blanks.
$ cd /scripts
$ python preprocess_corpus.py$ cd /fine-tune
$ bash train.sh$ cd /inference
$ bash run_predict.shThe metric used for evaluation is topk.