gobbli.util module¶
-
gobbli.util.
LOGGER
= <Logger gobbli.util (INFO)>¶
-
gobbli.util.
T1
= ~T1¶
-
gobbli.util.
T2
= ~T2¶
-
class
gobbli.util.
TokenizeMethod
[source]¶ Bases:
enum.Enum
Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.
-
SPLIT
¶ Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.
-
SPACY
¶ Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.
-
SENTENCEPIECE
¶ SentencePiece-based tokenization.
-
SENTENCEPIECE
= 'sentencepiece'
-
SPACY
= 'spacy'
-
SPLIT
= 'split'
-
-
gobbli.util.
as_multiclass
(y, is_multilabel)[source]¶ - Return type
List
[str
]- Returns
Labels in multiclass format, validated against possible mismatch between the multilabel attribute and data format.
-
gobbli.util.
as_multilabel
(y, is_multilabel)[source]¶ - Return type
List
[List
[str
]]- Returns
Labels in multilabel format, validated against possible mismatch between the multilabel attribute and data format.
-
gobbli.util.
assert_in
(name, val, container)[source]¶ Raise an informative error if the passed value isn’t in the given container.
-
gobbli.util.
assert_type
(name, val, cls)[source]¶ Raise an informative error if the passed value isn’t of the given type.
-
gobbli.util.
blob_to_dir
(blob, dir_path)[source]¶ Extract the given blob (assumed to be a compressed directory created by
dir_to_blob()
) to the given directory.
-
gobbli.util.
cleanup
(force=False, full=False)[source]¶ Cleans up the gobbli directory, removing files and directories. By default, removes only task input/output. If the
full
argument is True, removes all data, including downloaded model weights and datasets.
-
gobbli.util.
collect_labels
(y)[source]¶ Collect the unique labels in the given target variable, which could be in multiclass or multilabel format.
- Parameters
y¶ (
Any
) – Target variable- Return type
List
[str
]- Returns
An ordered list of all labels in y.
-
gobbli.util.
copy_file
(src_path, dest_path)[source]¶ Copy a file from the source to destination. Be smart – only copy if the source is newer than the destination.
-
gobbli.util.
default_gobbli_dir
()[source]¶ - Return type
Path
- Returns
The default directory to be used to store gobbli data if there’s no user-specified default.
-
gobbli.util.
detokenize
(method, all_tokens, model_path=None)[source]¶ Detokenize a nested list of tokens into a list of strings, assuming they were created using the given predefined method.
- Parameters
method¶ (
TokenizeMethod
) – The type of tokenization to reverse.tokens¶ – The nested list of tokens to detokenize.
model_path¶ (
Optional
[Path
]) – Path to load a trained model from. Required if the tokenization method requires training a model; otherwise ignored.
- Return type
List
[str
]- Returns
List of texts.
-
gobbli.util.
dir_to_blob
(dir_path)[source]¶ Archive a directory and save it as a blob in-memory. Useful for storing a directory’s contents in an in-memory object store. Use compression to reduce file size. Extract with
blob_to_dir()
.- Parameters
dir_path¶ (
Path
) – Path to the directory to be archived.- Return type
bytes
- Returns
The compressed directory as a binary buffer.
-
gobbli.util.
disk_usage
()[source]¶ - Return type
int
- Returns
Disk usage (in bytes) of all gobbli data in the current gobbli directory.
-
gobbli.util.
download_archive
(archive_url, archive_extract_dir, junk_paths=False, filename=None)[source]¶ Save an archive in the given directory and extract its contents. Automatically retry the download once if the file comes back corrupted (in case it’s left over from a partial download that was cancelled before).
- Parameters
archive_url¶ (
str
) – URL for the archive.archive_extract_dir¶ (
Path
) – Download the archive and extract it to this directory.junk_paths¶ (
bool
) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.filename¶ (
Optional
[str
]) – If given, store the downloaded file under this name instead of one automatically inferred from the URL.
- Return type
Path
- Returns
Path to the directory containing the extracted file contents.
-
gobbli.util.
download_dir
()[source]¶ - Return type
Path
- Returns
The directory used to store gobbli downloaded files.
-
gobbli.util.
download_file
(url, filename=None)[source]¶ Save a file in the gobbli download directory if it doesn’t already exist there. Stream the download to avoid running out of memory.
-
gobbli.util.
escape_line_delimited_text
(text)[source]¶ Convert a single text possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.
- Parameters
text¶ (
str
) – The text to convert.- Return type
str
- Returns
The text with newlines and whitespace taken care of.
-
gobbli.util.
escape_line_delimited_texts
(texts)[source]¶ Convert a list of texts possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.
- Parameters
texts¶ (
List
[str
]) – The list of texts to convert.- Return type
str
- Returns
The newline-delimited string.
-
gobbli.util.
extract_archive
(archive_path, archive_extract_dir, junk_paths=False)[source]¶ Extract an archive to the given directory.
- Parameters
- Returns
Path to the directory containing the extracted file contents.
-
gobbli.util.
format_duration
(seconds)[source]¶ Nicely format a given duration in seconds.
- Parameters
seconds¶ (
float
) – The duration to format, in seconds.- Return type
str
- Returns
The duration formatted as a string with unit of measurement appended.
-
gobbli.util.
generate_uuid
()[source]¶ Generate a universally unique ID to be used for randomly naming directories, models, tasks, etc.
- Return type
str
-
gobbli.util.
gobbli_dir
()[source]¶ - Return type
Path
- Returns
The directory used to store gobbli data. Can be overridden using the GOBBLI_DIR environment variable.
-
gobbli.util.
human_disk_usage
()[source]¶ - Return type
str
- Returns
Human-readable string representing disk usage of all gobbli data in the current gobbli directory.
-
gobbli.util.
is_archive
(filepath)[source]¶ - Parameters
filepath¶ (
Path
) – Path to a file.- Return type
bool
- Returns
Whether the file is an archive supported by
extract_archive()
.
-
gobbli.util.
is_dir_empty
(dir_path)[source]¶ Determine whether a given directory is empty. Assumes the directory exists.
- Parameters
dir_path¶ (
Path
) – The directory to check for emptiness.- Return type
bool
- Returns
True
if the directory is empty, otherwiseFalse
.
-
gobbli.util.
is_multilabel
(y)[source]¶ - Return type
bool
- Returns
True if the passed dataset is formatted as a multilabel problem, otherwise false (it’s multiclass).
-
gobbli.util.
model_dir
()[source]¶ - Return type
Path
- Returns
The subdirectory storing all model data and task run input/output.
-
gobbli.util.
multiclass_to_multilabel_target
(y)[source]¶ - Parameters
y¶ (
List
[str
]) – Multiclass-formatted target variable- Return type
List
[List
[str
]]- Returns
The multiclass-formatted variable generalized to a multilabel context
-
gobbli.util.
multilabel_to_indicator_df
(y, labels)[source]¶ Convert a list of label lists to a 0/1 indicator dataframe.
-
gobbli.util.
pred_prob_to_pred_label
(y_pred_proba)[source]¶ Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a list of predicted classes.
- Return type
List
[str
]
-
gobbli.util.
pred_prob_to_pred_multilabel
(y_pred_proba, threshold=0.5)[source]¶ Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a dataframe of predicted label indicators (shape (n_samples, n_classes)).
- Return type
DataFrame
-
gobbli.util.
read_metadata
(file_path)[source]¶ Read JSON-formatted metadata from the given file path.
- Parameters
file_path¶ (
Path
) – The path to read JSON metadata from.- Return type
Dict
[str
,Any
]- Returns
The JSON object read from the file.
-
gobbli.util.
shuffle_together
(l1, l2, seed=None)[source]¶ Shuffle two lists together, so their order is random but the individual elements still correspond. Ex. shuffle a list of texts and a list of labels so the labels still correspond correctly to the texts. The lists are shuffled in-place.
The lists must be the same length for this to make sense.
-
gobbli.util.
tokenize
(method, texts, model_path=None, vocab_size=2000)[source]¶ Tokenize a list of texts using a predefined method.
- Parameters
texts¶ (
List
[str
]) – Texts to tokenize.method¶ (
TokenizeMethod
) – The type of tokenization to apply.model_path¶ (
Optional
[Path
]) – Path to save a trained model to. Required if the tokenization method requires training a model; otherwise ignored. If the model doesn’t exist, it will be trained; if it does, the trained model will be reused.vocab_size¶ (
int
) – Number of terms in the vocabulary for tokenization methods with a fixed vocabulary size. You may need to lower this if you get tokenization errors or raise it if your texts have a very diverse vocabulary.
- Return type
List
[List
[str
]]- Returns
List of tokenized texts.
-
gobbli.util.
truncate_text
(text, length)[source]¶ Truncate the text to the given length. Append an ellipsis to make it clear the text was truncated.