gobbli.util module¶

gobbli.util.LOGGER = <Logger gobbli.util (INFO)>¶

gobbli.util.T1 = ~T1¶

gobbli.util.T2 = ~T2¶

class gobbli.util.TokenizeMethod[source]¶

Bases: enum.Enum

Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.

SPLIT¶: Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.

SPACY¶: Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.

SENTENCEPIECE¶: SentencePiece-based tokenization.

SENTENCEPIECE = 'sentencepiece'

SPACY = 'spacy'

SPLIT = 'split'

gobbli.util.as_multiclass(y, is_multilabel)[source]¶

Return type: List[str]
Returns: Labels in multiclass format, validated against possible mismatch between the multilabel attribute and data format.

gobbli.util.as_multilabel(y, is_multilabel)[source]¶

Return type: List[List[str]]
Returns: Labels in multilabel format, validated against possible mismatch between the multilabel attribute and data format.

gobbli.util.assert_in(name, val, container)[source]¶

Raise an informative error if the passed value isn’t in the given container.

Parameters

name¶ (str) – User-friendly name of the value to be printed in an exception if raised.
val¶ (Any) – The value to check for membership in the container.
container¶ (Container[+T_co]) – The container that should contain the value.

gobbli.util.assert_type(name, val, cls)[source]¶

Raise an informative error if the passed value isn’t of the given type.

Parameters

name¶ (str) – User-friendly name of the value to be printed in an exception if raised.
val¶ (Any) – The value to check type of.
cls¶ (Any) – The type or tuple of types to compare the value’s type against.

gobbli.util.blob_to_dir(blob, dir_path)[source]¶

Extract the given blob (assumed to be a compressed directory created by dir_to_blob()) to the given directory.

Parameters

blob¶ (bytes) – The compressed directory as a binary buffer.
dir_path¶ (Path) – Path to extract the directory to.

gobbli.util.cleanup(force=False, full=False)[source]¶

Cleans up the gobbli directory, removing files and directories. By default, removes only task input/output. If the full argument is True, removes all data, including downloaded model weights and datasets.

Parameters

force¶ (bool) – If True, don’t prompt for user confirmation before cleaning up.
full¶ (bool) – If True, remove all data (including downloaded model weights and datasets).

gobbli.util.collect_labels(y)[source]¶

Collect the unique labels in the given target variable, which could be in multiclass or multilabel format.

Parameters: y¶ (Any) – Target variable
Return type: List[str]
Returns: An ordered list of all labels in y.

gobbli.util.copy_file(src_path, dest_path)[source]¶

Copy a file from the source to destination. Be smart – only copy if the source is newer than the destination.

Parameters

src_path¶ (Path) – The path to the source file.
dest_path¶ (Path) – The path to the destination where the source should be copied.

Return type

bool

Returns

True if the file was copied, otherwise False if the destination was already up to date.

gobbli.util.default_gobbli_dir()[source]¶

Return type: Path
Returns: The default directory to be used to store gobbli data if there’s no user-specified default.

gobbli.util.detokenize(method, all_tokens, model_path=None)[source]¶

Detokenize a nested list of tokens into a list of strings, assuming they were created using the given predefined method.

Parameters

method¶ (TokenizeMethod) – The type of tokenization to reverse.
tokens¶ – The nested list of tokens to detokenize.
model_path¶ (Optional[Path]) – Path to load a trained model from. Required if the tokenization method requires training a model; otherwise ignored.

Return type

List[str]

Returns

List of texts.

gobbli.util.dir_to_blob(dir_path)[source]¶

Archive a directory and save it as a blob in-memory. Useful for storing a directory’s contents in an in-memory object store. Use compression to reduce file size. Extract with blob_to_dir().

Parameters: dir_path¶ (Path) – Path to the directory to be archived.
Return type: bytes
Returns: The compressed directory as a binary buffer.

gobbli.util.disk_usage()[source]¶

Return type: int
Returns: Disk usage (in bytes) of all gobbli data in the current gobbli directory.

gobbli.util.download_archive(archive_url, archive_extract_dir, junk_paths=False, filename=None)[source]¶

Save an archive in the given directory and extract its contents. Automatically retry the download once if the file comes back corrupted (in case it’s left over from a partial download that was cancelled before).

Parameters

archive_url¶ (str) – URL for the archive.
archive_extract_dir¶ (Path) – Download the archive and extract it to this directory.
junk_paths¶ (bool) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.
filename¶ (Optional[str]) – If given, store the downloaded file under this name instead of one automatically inferred from the URL.

Return type

Path

Returns

Path to the directory containing the extracted file contents.

gobbli.util.download_dir()[source]¶

Return type: Path
Returns: The directory used to store gobbli downloaded files.

gobbli.util.download_file(url, filename=None)[source]¶

Save a file in the gobbli download directory if it doesn’t already exist there. Stream the download to avoid running out of memory.

Parameters

url¶ (str) – URL for the file.
filename¶ (Optional[str]) – If passed, use this as the filename instead of the best-effort one determined from the URL.

Return type

Path

Returns

The path to the downloaded file.

gobbli.util.escape_line_delimited_text(text)[source]¶

Convert a single text possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.

Parameters: text¶ (str) – The text to convert.
Return type: str
Returns: The text with newlines and whitespace taken care of.

gobbli.util.escape_line_delimited_texts(texts)[source]¶

Convert a list of texts possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.

Parameters: texts¶ (List[str]) – The list of texts to convert.
Return type: str
Returns: The newline-delimited string.

gobbli.util.extract_archive(archive_path, archive_extract_dir, junk_paths=False)[source]¶

Extract an archive to the given directory.

Parameters

archive_path¶ (Path) – Path to the archive file.
archive_extract_dir¶ (Path) – Extract the archive to this directory.
junk_paths¶ (bool) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.

Returns

Path to the directory containing the extracted file contents.

gobbli.util.format_duration(seconds)[source]¶

Nicely format a given duration in seconds.

Parameters: seconds¶ (float) – The duration to format, in seconds.
Return type: str
Returns: The duration formatted as a string with unit of measurement appended.

gobbli.util.generate_uuid()[source]¶

Generate a universally unique ID to be used for randomly naming directories, models, tasks, etc.

Return type: str

gobbli.util.gobbli_dir()[source]¶

Return type: Path
Returns: The directory used to store gobbli data. Can be overridden using the GOBBLI_DIR environment variable.

gobbli.util.gobbli_version()[source]¶

Return type: str
Returns: The version of gobbli installed.

gobbli.util.human_disk_usage()[source]¶

Return type: str
Returns: Human-readable string representing disk usage of all gobbli data in the current gobbli directory.

gobbli.util.is_archive(filepath)[source]¶

Parameters: filepath¶ (Path) – Path to a file.
Return type: bool
Returns: Whether the file is an archive supported by extract_archive().

gobbli.util.is_dir_empty(dir_path)[source]¶

Determine whether a given directory is empty. Assumes the directory exists.

Parameters: dir_path¶ (Path) – The directory to check for emptiness.
Return type: bool
Returns: True if the directory is empty, otherwise False.

gobbli.util.is_multilabel(y)[source]¶

Return type: bool
Returns: True if the passed dataset is formatted as a multilabel problem, otherwise false (it’s multiclass).

gobbli.util.model_dir()[source]¶

Return type: Path
Returns: The subdirectory storing all model data and task run input/output.

gobbli.util.multiclass_to_multilabel_target(y)[source]¶

Parameters: y¶ (List[str]) – Multiclass-formatted target variable
Return type: List[List[str]]
Returns: The multiclass-formatted variable generalized to a multilabel context

gobbli.util.multilabel_to_indicator_df(y, labels)[source]¶

Convert a list of label lists to a 0/1 indicator dataframe.

Parameters

y¶ (List[List[str]]) – List of label lists
labels¶ (List[str]) – List of all unique labels found in y

Return type

DataFrame

Returns

The dataframe will have a column for each label and a row for each observation, with a 1 if the observation has that label or a 0 if not.

gobbli.util.pred_prob_to_pred_label(y_pred_proba)[source]¶

Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a list of predicted classes.

Return type: List[str]

gobbli.util.pred_prob_to_pred_multilabel(y_pred_proba, threshold=0.5)[source]¶

Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a dataframe of predicted label indicators (shape (n_samples, n_classes)).

Return type: DataFrame

gobbli.util.read_metadata(file_path)[source]¶

Read JSON-formatted metadata from the given file path.

Parameters: file_path¶ (Path) – The path to read JSON metadata from.
Return type: Dict[str, Any]
Returns: The JSON object read from the file.

gobbli.util.shuffle_together(l1, l2, seed=None)[source]¶

Shuffle two lists together, so their order is random but the individual elements still correspond. Ex. shuffle a list of texts and a list of labels so the labels still correspond correctly to the texts. The lists are shuffled in-place.

The lists must be the same length for this to make sense.

Parameters

l1¶ (List[~T1]) – The first list to be shuffled.
l2¶ (List[~T2]) – The second list to be shuffled.
seed¶ (Optional[int]) – Seed for the random number generator, if any

gobbli.util.tokenize(method, texts, model_path=None, vocab_size=2000)[source]¶

Tokenize a list of texts using a predefined method.

Parameters

texts¶ (List[str]) – Texts to tokenize.
method¶ (TokenizeMethod) – The type of tokenization to apply.
model_path¶ (Optional[Path]) – Path to save a trained model to. Required if the tokenization method requires training a model; otherwise ignored. If the model doesn’t exist, it will be trained; if it does, the trained model will be reused.
vocab_size¶ (int) – Number of terms in the vocabulary for tokenization methods with a fixed vocabulary size. You may need to lower this if you get tokenization errors or raise it if your texts have a very diverse vocabulary.

Return type

List[List[str]]

Returns

List of tokenized texts.

gobbli.util.truncate_text(text, length)[source]¶

Truncate the text to the given length. Append an ellipsis to make it clear the text was truncated.

Parameters

text¶ (str) – The text to truncate.
length¶ (int) – Maximum length of the truncated text.

Return type

str

Returns

The truncated text.

gobbli.util.write_metadata(metadata, file_path)[source]¶

Write some JSON-formatted metadata to the given file path, formatted appropriately for human consumption.

Parameters

metadata¶ (Dict[str, Any]) – The valid JSON metadata to write. Nesting is allowed, but ensure all types in the structure are JSON-serializable.
file_path¶ (Path) – The path to write the metadata to.

gobbli.util module¶

Navigation

Related Topics