Advanced Usage

gobbli provides some additional features for further customization.

Filesystem Organization

gobbli persists lots of data to disk, including datasets, model weights, and output. All data will be saved under the gobbli directory, which is ~/.gobbli by default. If you want to keep your gobbli data elsewhere, set the GOBBLI_DIR environment variable to somewhere different.

The default directory hierarchy isolates models and task runs using unique directories named by UUIDs, which aren’t particularly readable. If you need more control over the directory hierarchy, you can use the data_dir argument when creating a model:

from gobbli.model import BERT
from pathlib import Path

clf = BERT(
    data_dir=Path("./my_bert/")
)

This will override the default organization and place all model data under the given directory, which must be empty. For a given task (training/prediction/embedding/etc), you can also supply a user-provided name to replace the UUID:

clf.train(train_input, train_dir_name='train_batch_128')

The above will store all training input and output files in a directory named train_batch_128 under the model’s data_dir.

The hierarchy generally looks something like this:

GOBBLI_DIR/model/<model_class_name>/<model_data_dir_name>/<task_name>/<task_data_dir_name>/{input,output}/

For example:

GOBBLI_DIR/model/BERT/my_bert/train/train_batch_128/{input,output}/

Housekeeping

If you’ve run a lot of tasks, you may start to fill up your disk. gobbli provides a couple of utilities to manage disk usage in the default gobbli directory hierarchy. You can use gobbli.util.human_disk_usage() to print the current disk usage of the gobbli directory and gobbli.util.cleanup() to remove unused files. The cleanup function removes only task input and output by default, leaving downloaded model weights and datasets. If you want to erase all gobbli data, you can use the full argument.

>>> gobbli.util.human_disk_usage()
'21.3 GB'
>>> gobbli.util.cleanup()
Cleanup will remove all task input/output, including trained models.  Are you sure? [Y/n]y
>>> gobbli.util.human_disk_usage()
'8.8 GB'

Advanced Experimentation

gobbli experiments are limited to a predetermined workflow but include some options for customization.

  • Parameter Tuning: Experiments accept a param_grid option that enables users to pass a parameter grid specifying a set of different parameters to try. The grid should be a dictionary with parameter names (strings) as keys and lists of parameter settings to try as values. Each parameter combination will be trained on the training set and evaluated on the validation set, and the best combination will be retrained on the combined training/validation set and evaluated on the test set for the final results.

  • Parallel/Distributed Experimentation: gobbli uses ray under the hood to run multiple training/validation steps in parallel. Ray creates and uses a local cluster composed of all CPUs on your machine by default, but it can also be used to add GPUs or connect to an existing distributed cluster. Note ray (and gobbli) must be installed on all worker nodes in the cluster. Experiments accept an optional ray_kwargs option, which is passed directly to ray.init(). Use this parameter for more control over the underlying Ray cluster. NOTE: If you’re running an experiment on a single node, gobbli will simply pass checkpoints around as file paths, since the Ray master and workers share a filesystem. If you’re running a distributed experiment, gobbli cannot rely on file paths being the same between workers and the master node, so it will save checkpoints as gzip-compressed tar archives in memory and store them in the Ray object store. This means your object store must be able to hold weights for as many trials as will be run in one experiment, which may be a lot of memory.

  • Enabling GPU support: During experiments, gobbli exposes GPUs to models based on whether they’re made available to the Ray cluster and are required for tasks. To run a GPU-enabled experiment, reserve a nonzero number of GPUs for each task via the task_num_gpus parameter and tell Ray the cluster contains a nonzero number of GPUs via the num_gpus argument to ray.init().

Metadata

Each model and task write JSON-formatted metadata to their respective data directories containing parameters and other useful information. The metadata can be read to recall what parameters were used to train a given model, where the checkpoint for a training task is stored, how many embeddings were generated, etc.

Model metadata is stored in the model’s data directory in a file named gobbli-model-meta.json. The metadata generally contains model parameters that can be used to recreate the same model later (see Re-Initializing Models). See the init() method for classes derived from gobbli.model.base.BaseModel for more info on which keys should be expected in the metadata. Example model metadata:

{
    "max_seq_length": 128
}

Task metadata is stored in the task’s directory in a file named gobbli-task-meta.json. For input tasks, the metadata generally contains the task parameters and some summary information about the input. For output tasks, the metadata usually has the locations of any generated artifacts and summary information about the generated output. See the metadata() method for classes derived from gobbli.io.TaskIO for more info on which keys should be expected in the metadata. Example task metadata:

{
    "train_batch_size": 32,
    "valid_batch_size": 8,
    "num_train_epochs": 1,
    "len_X_train": 40,
    "len_y_train": 40,
    "len_X_valid": 10,
    "len_y_valid": 10
}

Re-Initializing Models

You can re-initialize a model from the metadata in an existing data directory using the load_existing argument – the model will reload its parameters from the metadata file in that directory, so you don’t have to specify them again. To reload the model created with non-default parameters above in a different session:

clf = BERT(
    data_dir=Path("./my_bert/"),
    load_existing=True
)

Running gobbli Inside a Docker Container

Since gobbli must spawn its own Docker containers, there are some extra complications when trying to run it from inside a Docker container (as opposed to natively on the host machine).

  • You must mount /var/run/docker.sock on the host to the same directory on the container. This is needed to allow the Docker client in the container to communicate with the daemon on the host.

  • Any directories that should contain persistent files (your gobbli directory, custom model directories, etc.) which themselves need to be mounted in spawned containers must be mounted in the main Docker container with the same name they would have on the host. This is because the Docker daemon can only see paths on the host, so any paths that need to be mounted in containers must also exist on the host. You can accomplish this with something like the following mount declaration: $(pwd):$(pwd).

See the gobbli-ci service declaration in ci/docker-compose.yml for a working example of how to properly run gobbli inside a Docker container.