gobbli.util module

gobbli.util.LOGGER = <Logger gobbli.util (INFO)>
gobbli.util.T1 = ~T1
gobbli.util.T2 = ~T2
class gobbli.util.TokenizeMethod[source]

Bases: enum.Enum

Enum describing the different canned tokenization methods gobbli supports. Processes requiring tokenization should generally allow a user to pass in a custom tokenization function if their needs aren’t met by one of these.

SPLIT

Naive tokenization based on whitespace. Probably only useful for testing. Tokens will be lowercased.

SPACY

Simple tokenization using spaCy’s English language model. Tokens will be lowercased, and non-alphabetic tokens will be filtered out.

SENTENCEPIECE

SentencePiece-based tokenization.

SENTENCEPIECE = 'sentencepiece'
SPACY = 'spacy'
SPLIT = 'split'
gobbli.util.as_multiclass(y, is_multilabel)[source]
Return type

List[str]

Returns

Labels in multiclass format, validated against possible mismatch between the multilabel attribute and data format.

gobbli.util.as_multilabel(y, is_multilabel)[source]
Return type

List[List[str]]

Returns

Labels in multilabel format, validated against possible mismatch between the multilabel attribute and data format.

gobbli.util.assert_in(name, val, container)[source]

Raise an informative error if the passed value isn’t in the given container.

Parameters
  • name (str) – User-friendly name of the value to be printed in an exception if raised.

  • val (Any) – The value to check for membership in the container.

  • container (Container[+T_co]) – The container that should contain the value.

gobbli.util.assert_type(name, val, cls)[source]

Raise an informative error if the passed value isn’t of the given type.

Parameters
  • name (str) – User-friendly name of the value to be printed in an exception if raised.

  • val (Any) – The value to check type of.

  • cls (Any) – The type or tuple of types to compare the value’s type against.

gobbli.util.blob_to_dir(blob, dir_path)[source]

Extract the given blob (assumed to be a compressed directory created by dir_to_blob()) to the given directory.

Parameters
  • blob (bytes) – The compressed directory as a binary buffer.

  • dir_path (Path) – Path to extract the directory to.

gobbli.util.cleanup(force=False, full=False)[source]

Cleans up the gobbli directory, removing files and directories. By default, removes only task input/output. If the full argument is True, removes all data, including downloaded model weights and datasets.

Parameters
  • force (bool) – If True, don’t prompt for user confirmation before cleaning up.

  • full (bool) – If True, remove all data (including downloaded model weights and datasets).

gobbli.util.collect_labels(y)[source]

Collect the unique labels in the given target variable, which could be in multiclass or multilabel format.

Parameters

y (Any) – Target variable

Return type

List[str]

Returns

An ordered list of all labels in y.

gobbli.util.copy_file(src_path, dest_path)[source]

Copy a file from the source to destination. Be smart – only copy if the source is newer than the destination.

Parameters
  • src_path (Path) – The path to the source file.

  • dest_path (Path) – The path to the destination where the source should be copied.

Return type

bool

Returns

True if the file was copied, otherwise False if the destination was already up to date.

gobbli.util.default_gobbli_dir()[source]
Return type

Path

Returns

The default directory to be used to store gobbli data if there’s no user-specified default.

gobbli.util.detokenize(method, all_tokens, model_path=None)[source]

Detokenize a nested list of tokens into a list of strings, assuming they were created using the given predefined method.

Parameters
  • method (TokenizeMethod) – The type of tokenization to reverse.

  • tokens – The nested list of tokens to detokenize.

  • model_path (Optional[Path]) – Path to load a trained model from. Required if the tokenization method requires training a model; otherwise ignored.

Return type

List[str]

Returns

List of texts.

gobbli.util.dir_to_blob(dir_path)[source]

Archive a directory and save it as a blob in-memory. Useful for storing a directory’s contents in an in-memory object store. Use compression to reduce file size. Extract with blob_to_dir().

Parameters

dir_path (Path) – Path to the directory to be archived.

Return type

bytes

Returns

The compressed directory as a binary buffer.

gobbli.util.disk_usage()[source]
Return type

int

Returns

Disk usage (in bytes) of all gobbli data in the current gobbli directory.

gobbli.util.download_archive(archive_url, archive_extract_dir, junk_paths=False, filename=None)[source]

Save an archive in the given directory and extract its contents. Automatically retry the download once if the file comes back corrupted (in case it’s left over from a partial download that was cancelled before).

Parameters
  • archive_url (str) – URL for the archive.

  • archive_extract_dir (Path) – Download the archive and extract it to this directory.

  • junk_paths (bool) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.

  • filename (Optional[str]) – If given, store the downloaded file under this name instead of one automatically inferred from the URL.

Return type

Path

Returns

Path to the directory containing the extracted file contents.

gobbli.util.download_dir()[source]
Return type

Path

Returns

The directory used to store gobbli downloaded files.

gobbli.util.download_file(url, filename=None)[source]

Save a file in the gobbli download directory if it doesn’t already exist there. Stream the download to avoid running out of memory.

Parameters
  • url (str) – URL for the file.

  • filename (Optional[str]) – If passed, use this as the filename instead of the best-effort one determined from the URL.

Return type

Path

Returns

The path to the downloaded file.

gobbli.util.escape_line_delimited_text(text)[source]

Convert a single text possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.

Parameters

text (str) – The text to convert.

Return type

str

Returns

The text with newlines and whitespace taken care of.

gobbli.util.escape_line_delimited_texts(texts)[source]

Convert a list of texts possibly containing newlines and other troublesome whitespace into a string suitable for writing and reading to a file where newlines will divide up the texts.

Parameters

texts (List[str]) – The list of texts to convert.

Return type

str

Returns

The newline-delimited string.

gobbli.util.extract_archive(archive_path, archive_extract_dir, junk_paths=False)[source]

Extract an archive to the given directory.

Parameters
  • archive_path (Path) – Path to the archive file.

  • archive_extract_dir (Path) – Extract the archive to this directory.

  • junk_paths (bool) – If True, disregard the archive’s internal directory hierarchy and extract all files directly to the output directory.

Returns

Path to the directory containing the extracted file contents.

gobbli.util.format_duration(seconds)[source]

Nicely format a given duration in seconds.

Parameters

seconds (float) – The duration to format, in seconds.

Return type

str

Returns

The duration formatted as a string with unit of measurement appended.

gobbli.util.generate_uuid()[source]

Generate a universally unique ID to be used for randomly naming directories, models, tasks, etc.

Return type

str

gobbli.util.gobbli_dir()[source]
Return type

Path

Returns

The directory used to store gobbli data. Can be overridden using the GOBBLI_DIR environment variable.

gobbli.util.gobbli_version()[source]
Return type

str

Returns

The version of gobbli installed.

gobbli.util.human_disk_usage()[source]
Return type

str

Returns

Human-readable string representing disk usage of all gobbli data in the current gobbli directory.

gobbli.util.is_archive(filepath)[source]
Parameters

filepath (Path) – Path to a file.

Return type

bool

Returns

Whether the file is an archive supported by extract_archive().

gobbli.util.is_dir_empty(dir_path)[source]

Determine whether a given directory is empty. Assumes the directory exists.

Parameters

dir_path (Path) – The directory to check for emptiness.

Return type

bool

Returns

True if the directory is empty, otherwise False.

gobbli.util.is_multilabel(y)[source]
Return type

bool

Returns

True if the passed dataset is formatted as a multilabel problem, otherwise false (it’s multiclass).

gobbli.util.model_dir()[source]
Return type

Path

Returns

The subdirectory storing all model data and task run input/output.

gobbli.util.multiclass_to_multilabel_target(y)[source]
Parameters

y (List[str]) – Multiclass-formatted target variable

Return type

List[List[str]]

Returns

The multiclass-formatted variable generalized to a multilabel context

gobbli.util.multilabel_to_indicator_df(y, labels)[source]

Convert a list of label lists to a 0/1 indicator dataframe.

Parameters
  • y (List[List[str]]) – List of label lists

  • labels (List[str]) – List of all unique labels found in y

Return type

DataFrame

Returns

The dataframe will have a column for each label and a row for each observation, with a 1 if the observation has that label or a 0 if not.

gobbli.util.pred_prob_to_pred_label(y_pred_proba)[source]

Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a list of predicted classes.

Return type

List[str]

gobbli.util.pred_prob_to_pred_multilabel(y_pred_proba, threshold=0.5)[source]

Convert a dataframe of predicted probabilities (shape (n_samples, n_classes)) to a dataframe of predicted label indicators (shape (n_samples, n_classes)).

Return type

DataFrame

gobbli.util.read_metadata(file_path)[source]

Read JSON-formatted metadata from the given file path.

Parameters

file_path (Path) – The path to read JSON metadata from.

Return type

Dict[str, Any]

Returns

The JSON object read from the file.

gobbli.util.shuffle_together(l1, l2, seed=None)[source]

Shuffle two lists together, so their order is random but the individual elements still correspond. Ex. shuffle a list of texts and a list of labels so the labels still correspond correctly to the texts. The lists are shuffled in-place.

The lists must be the same length for this to make sense.

Parameters
  • l1 (List[~T1]) – The first list to be shuffled.

  • l2 (List[~T2]) – The second list to be shuffled.

  • seed (Optional[int]) – Seed for the random number generator, if any

gobbli.util.tokenize(method, texts, model_path=None, vocab_size=2000)[source]

Tokenize a list of texts using a predefined method.

Parameters
  • texts (List[str]) – Texts to tokenize.

  • method (TokenizeMethod) – The type of tokenization to apply.

  • model_path (Optional[Path]) – Path to save a trained model to. Required if the tokenization method requires training a model; otherwise ignored. If the model doesn’t exist, it will be trained; if it does, the trained model will be reused.

  • vocab_size (int) – Number of terms in the vocabulary for tokenization methods with a fixed vocabulary size. You may need to lower this if you get tokenization errors or raise it if your texts have a very diverse vocabulary.

Return type

List[List[str]]

Returns

List of tokenized texts.

gobbli.util.truncate_text(text, length)[source]

Truncate the text to the given length. Append an ellipsis to make it clear the text was truncated.

Parameters
  • text (str) – The text to truncate.

  • length (int) – Maximum length of the truncated text.

Return type

str

Returns

The truncated text.

gobbli.util.write_metadata(metadata, file_path)[source]

Write some JSON-formatted metadata to the given file path, formatted appropriately for human consumption.

Parameters
  • metadata (Dict[str, Any]) – The valid JSON metadata to write. Nesting is allowed, but ensure all types in the structure are JSON-serializable.

  • file_path (Path) – The path to write the metadata to.