Skip to content

UniversalDependencies/UD_Shanghainese-ShUD

Repository files navigation

Summary

UD Shanghainese-ShUD is the first UD treebank for Shanghainese.

Introduction

UD Shanghainese-ShUD is the first UD treebank for Shanghainese, a Wu Chinese variant spoken by approximately 14 million people. This treebank is annotated from a corpus with a focus on daily-use speech, which is a representative sample of contemporary Shanghainese. For details on the annotation method and pipelines, see the paper. Sentences are randomly split to train, test, and dev by ratios of 80%, 10%, 10%, respectively.

Shanghainese includes several geographical and historical variants. The focus of this treebank is on Middle and New Period Urban Shanghainese.

Acknowledgments

The open-source Scripted Chinese Shanghai Dialect Daily-use Speech Corpus by Magic Data, licensed under Creative Commons BY-NC-ND 4.0, is used. Additional permission for derivative research was granted by Beijing Magic Data Technology Co., Ltd.

References

Qizhen Yang. 2025. ShUD: the First Shanghainese Universal Dependency Treebank. In Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025), pages 186–193, Ljubljana, Slovenia. Association for Computational Linguistics. PDF

Changelog

  • 2025-11-15 v2.17
    • Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.17
License: CC BY-SA 4.0
Includes text: yes
Parallel: no
Genre: grammar-examples
Lemmas: manual native
UPOS: manual native
XPOS: not available
Features: manual native
Relations: manual native
Contributors: Yang, Qizhen
Contributing: here
Contact: qzyang.main@gmail.com
===============================================================================

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •