SentencePiece
释义 Definition
SentencePiece:一种用于自然语言处理(NLP)的子词(subword)分词/标记化工具与算法库,把文本切分成更小的单位(子词或字符片段),常用于训练机器翻译、语言模型等,以降低“未登录词”(OOV)问题。(常见实现包括 unigram 与 BPE 等子词模型。)
发音 Pronunciation (IPA)
/ˈsɛntənsˌpiːs/
例句 Examples
I trained a SentencePiece model on the dataset.
我在这份数据集上训练了一个 SentencePiece 模型。
SentencePiece helps multilingual models handle rare words by breaking text into subword units.
SentencePiece 通过把文本切成子词单元,帮助多语言模型更好地处理罕见词。
词源 Etymology
该词由 sentence(句子)+ piece(片段) 组合而来,字面意思是“句子的片段”。命名强调它把连续文本切分为可用于建模的“片段”(子词单位),并且常被设计为语言无关的分词方案。
相关词 Related Words
文献与作品 Literary / Notable Works
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing(Taku Kudo & John Richardson, 2018)
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations(Lan et al., 2019,使用 SentencePiece)
- *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)*(Raffel et al., 2020,使用 SentencePiece)
- mBART: Multilingual Denoising Pre-training for Neural Machine Translation(Liu et al., 2020,使用 SentencePiece)