gobbli.util module¶
-
gobbli.util.LOGGER= <Logger gobbli.util (INFO)>¶
-
gobbli.util.T1= ~T1¶
-
gobbli.util.T2= ~T2¶
-
class
gobbli.util.TokenizeMethod[source]¶ Bases:
enum.EnumEnum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.
-
SPLIT¶ Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.
-
SPACY¶ Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.
-
SENTENCEPIECE¶ SentencePiece-based tokenization.
-
SENTENCEPIECE= 'sentencepiece'
-
SPACY= 'spacy'
-
SPLIT= 'split'
-
-
gobbli.util.as_multiclass(y, is_multilabel)[source]¶ - Return type
List[str]- Returns
Labels in multiclass format, validated against possible mismatch between the multilabel attribute and data format.
-
gobbli.util.as_multilabel(y, is_multilabel)[source]¶ - Return type
List[List[str]]- Returns
Labels in multilabel format, validated against possible mismatch between the multilabel attribute and data format.
-
gobbli.util.assert_in(name, val, container)[source]¶ Raise an informative error if the passed value isn’t in the given container.
-
gobbli.util.assert_type(name, val, cls)[source]¶ Raise an informative error if the passed value isn’t of the given type.
-
gobbli.util.blob_to_dir(blob, dir_path)[source]¶ Extract the given blob (assumed to be a compressed directory created by
dir_to_blob()) to the given directory.
-
gobbli.util.cleanup(force=False, full=False)[source]¶ Cleans up the gobbli directory, removing files and directories. By default, removes only task input/output. If the
fullargument is True, removes all data, including downloaded model weights and datasets.
-
gobbli.util.collect_labels(y)[source]¶ Collect the unique labels in the given target variable, which could be in multiclass or multilabel format.
- Parameters
y¶ (
Any) – Target variable- Return type
List[str]- Returns
An ordered list of all labels in y.
-
gobbli.util.copy_file(src_path, dest_path)[source]¶ Copy a file from the source to destination. Be smart – only copy if the source is newer than the destination.
-
gobbli.util.default_gobbli_dir()[source]¶ - Return type
Path- Returns
The default directory to be used to store gobbli data if there’s no user-specified default.
-
gobbli.util.detokenize(method, all_tokens, model_path=None)[source]¶ Detokenize a nested list of tokens into a list of strings, assuming they were created using the given predefined method.
- Parameters
method¶ (
TokenizeMethod) – The type of tokenization to reverse.tokens¶ – The nested list of tokens to detokenize.
model_path¶ (
Optional[Path]) – Path to load a trained model from. Required if the tokenization method requires training a model; otherwise ignored.
- Return type
List[str]- Returns
List of texts.
-
gobbli.util.dir_to_blob(dir_path)[source]¶ Archive a directory and save it as a blob in-memory. Useful for storing a directory’s contents in an in-memory object store. Use compression to reduce file size. Extract with
blob_to_dir().- Parameters
dir_path¶ (
Path) – Path to the directory to be archived.- Return type
bytes- Returns
The compressed directory as a binary buffer.
-
gobbli.util.disk_usage()[source]¶ - Return type
int- Returns
Disk usage (in bytes) of all gobbli data in the current gobbli directory.
-
gobbli.util.download_archive(archive_url, archive_extract_dir, junk_paths=False, filename=None)[source]¶ Save an archive in the given directory and extract its contents. Automatically retry the download once if the file comes back corrupted (in case it’s left over from a partial download that was cancelled before).
- Parameters
archive_url¶ (
str) – URL for the archive.archive_extract_dir¶ (
Path) – Download the archive and extract it to this directory.junk_paths¶ (
bool) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.filename¶ (
Optional[str]) – If given, store the downloaded file under this name instead of one automatically inferred from the URL.
- Return type
Path- Returns
Path to the directory containing the extracted file contents.
-
gobbli.util.download_dir()[source]¶ - Return type
Path- Returns
The directory used to store gobbli downloaded files.
-
gobbli.util.download_file(url, filename=None)[source]¶ Save a file in the gobbli download directory if it doesn’t already exist there. Stream the download to avoid running out of memory.
-
gobbli.util.escape_line_delimited_text(text)[source]¶ Convert a single text possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.
- Parameters
text¶ (
str) – The text to convert.- Return type
str- Returns
The text with newlines and whitespace taken care of.
-
gobbli.util.escape_line_delimited_texts(texts)[source]¶ Convert a list of texts possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.
- Parameters
texts¶ (
List[str]) – The list of texts to convert.- Return type
str- Returns
The newline-delimited string.
-
gobbli.util.extract_archive(archive_path, archive_extract_dir, junk_paths=False)[source]¶ Extract an archive to the given directory.
- Parameters
- Returns
Path to the directory containing the extracted file contents.
-
gobbli.util.format_duration(seconds)[source]¶ Nicely format a given duration in seconds.
- Parameters
seconds¶ (
float) – The duration to format, in seconds.- Return type
str- Returns
The duration formatted as a string with unit of measurement appended.
-
gobbli.util.generate_uuid()[source]¶ Generate a universally unique ID to be used for randomly naming directories, models, tasks, etc.
- Return type
str
-
gobbli.util.gobbli_dir()[source]¶ - Return type
Path- Returns
The directory used to store gobbli data. Can be overridden using the GOBBLI_DIR environment variable.
-
gobbli.util.human_disk_usage()[source]¶ - Return type
str- Returns
Human-readable string representing disk usage of all gobbli data in the current gobbli directory.
-
gobbli.util.is_archive(filepath)[source]¶ - Parameters
filepath¶ (
Path) – Path to a file.- Return type
bool- Returns
Whether the file is an archive supported by
extract_archive().
-
gobbli.util.is_dir_empty(dir_path)[source]¶ Determine whether a given directory is empty. Assumes the directory exists.
- Parameters
dir_path¶ (
Path) – The directory to check for emptiness.- Return type
bool- Returns
Trueif the directory is empty, otherwiseFalse.
-
gobbli.util.is_multilabel(y)[source]¶ - Return type
bool- Returns
True if the passed dataset is formatted as a multilabel problem, otherwise false (it’s multiclass).
-
gobbli.util.model_dir()[source]¶ - Return type
Path- Returns
The subdirectory storing all model data and task run input/output.
-
gobbli.util.multiclass_to_multilabel_target(y)[source]¶ - Parameters
y¶ (
List[str]) – Multiclass-formatted target variable- Return type
List[List[str]]- Returns
The multiclass-formatted variable generalized to a multilabel context
-
gobbli.util.multilabel_to_indicator_df(y, labels)[source]¶ Convert a list of label lists to a 0/1 indicator dataframe.
-
gobbli.util.pred_prob_to_pred_label(y_pred_proba)[source]¶ Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a list of predicted classes.
- Return type
List[str]
-
gobbli.util.pred_prob_to_pred_multilabel(y_pred_proba, threshold=0.5)[source]¶ Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a dataframe of predicted label indicators (shape (n_samples, n_classes)).
- Return type
DataFrame
-
gobbli.util.read_metadata(file_path)[source]¶ Read JSON-formatted metadata from the given file path.
- Parameters
file_path¶ (
Path) – The path to read JSON metadata from.- Return type
Dict[str,Any]- Returns
The JSON object read from the file.
-
gobbli.util.shuffle_together(l1, l2, seed=None)[source]¶ Shuffle two lists together, so their order is random but the individual elements still correspond. Ex. shuffle a list of texts and a list of labels so the labels still correspond correctly to the texts. The lists are shuffled in-place.
The lists must be the same length for this to make sense.
-
gobbli.util.tokenize(method, texts, model_path=None, vocab_size=2000)[source]¶ Tokenize a list of texts using a predefined method.
- Parameters
texts¶ (
List[str]) – Texts to tokenize.method¶ (
TokenizeMethod) – The type of tokenization to apply.model_path¶ (
Optional[Path]) – Path to save a trained model to. Required if the tokenization method requires training a model; otherwise ignored. If the model doesn’t exist, it will be trained; if it does, the trained model will be reused.vocab_size¶ (
int) – Number of terms in the vocabulary for tokenization methods with a fixed vocabulary size. You may need to lower this if you get tokenization errors or raise it if your texts have a very diverse vocabulary.
- Return type
List[List[str]]- Returns
List of tokenized texts.
-
gobbli.util.truncate_text(text, length)[source]¶ Truncate the text to the given length. Append an ellipsis to make it clear the text was truncated.