raptorkwok/cantonese-chinese-parallel-corpus
Viewer β’ Updated β’ 185k β’ 143 β’ 2
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
This is a Cantonese sentence tokenizer based on BART Chinese. It can be used along with our CCPC Parallel Corpus dataset.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("raptorkwok/pyctokenizer")
print(tokenizer.tokenize("ζεε»εε°ζ²εηι«ηεοΌ"))
# Output: ['ζε', 'ε»', 'ε', 'ε°ζ²ε', 'ηι«η', 'ε', 'οΌ']
Base model
OpenMOSS-Team/bart-base-chinese