UD Shanghainese-ShUD is the first UD treebank for Shanghainese.
UD Shanghainese-ShUD is the first UD treebank for Shanghainese, a Wu Chinese variant spoken by approximately 14 million people. This treebank is annotated from a corpus with a focus on daily-use speech, which is a representative sample of contemporary Shanghainese. For details on the annotation method and pipelines, see the paper. Sentences are randomly split to train, test, and dev by ratios of 80%, 10%, 10%, respectively.
Shanghainese includes several geographical and historical variants. The focus of this treebank is on Middle and New Period Urban Shanghainese.
The open-source Scripted Chinese Shanghai Dialect Daily-use Speech Corpus by Magic Data, licensed under Creative Commons BY-NC-ND 4.0, is used. Additional permission for derivative research was granted by Beijing Magic Data Technology Co., Ltd.
Qizhen Yang. 2025. ShUD: the First Shanghainese Universal Dependency Treebank. In Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025), pages 186–193, Ljubljana, Slovenia. Association for Computational Linguistics. PDF
- 2025-11-15 v2.17
- Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.17 License: CC BY-SA 4.0 Includes text: yes Parallel: no Genre: grammar-examples Lemmas: manual native UPOS: manual native XPOS: not available Features: manual native Relations: manual native Contributors: Yang, Qizhen Contributing: here Contact: qzyang.main@gmail.com ===============================================================================