gobbli package

Subpackages

Submodules

Module contents

class gobbli.TokenizeMethod[source]

Bases: enum.Enum

Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.

SPLIT

Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.

SPACY

Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.

SENTENCEPIECE

SentencePiece-based tokenization.

SENTENCEPIECE = 'sentencepiece'
SPACY = 'spacy'
SPLIT = 'split'