Skip to content
Snippets Groups Projects
Select Git revision
  • main
1 result

README.md

Blame
  • FFCV guide

    Introduction

    FFCV is a library that helps increasing the training speed of deep learning models. Commonly there is a simple cycle:

    1. the CPU loads the batch from the disk
    2. the batch gets moved to the CUDA device
    3. the CUDA device runs the inference

    This can easily become a bottleneck if either the CPU or the Disk is too slow. FFCV solves this problem by first storing the dataset in an optimized format and then using asynchronus preloading to remove any possible bottlenecks.

    This guide will present some tips and insights on how to efficiently use FFCV. The focus will lie on improving performance of model training on the HPC, as dataloading bottlenecks commonly occure in the HPC environment, either due to slow disk speeds (e.g the NFS file system), underutilized parallelism or inefficient memory usage etc.

    Installation

    This guide includes a small fully contained example project in the example directory, but integrating it into a larger project should work the same way.

    Prerequisites

    • conda runtime. On HPC this can be loaded using module purge && module load python/3.10-anaconda (module purge first unloads all modules, such as the default python 3.8 module, then loads the anaconda runtime)
    • Around 10GB of storage for installing the required packages (by default conda installs them in $HOME but the location can be changed)

    Installation with Pip

    Unfortunately pip cannot be used to install FFCV. Since pip only support the installation of python packages (contrary to conda), it is impossible to install all neccessary dependencies for FFCV (such as opencv, ffmpeg and jpeg_turbo).

    Installation with Conda

    Since the (C++) libaries are not preinstalled, we have to use conda to install them. The file environment.yml creates an environment with all required dependencies upon calling conda env create --name ffcv-guide --file environment.yml. Don't forget to activate the environment afterwards using conda activate ffcv-guide.

    Setup FFCV

    As mentioned before, FFCV works by storing the dataset in a custom format (so called .beton-files). This turns thousands of small files (typical for image datasets like imagenet1k) into a single large file. When creating the beton-file the user can set a compression ratio as well a maximum resolution.

    Generating the .beton files

    The script write_ffcv.py already contains example methods for writing Cifar10 and Imagenet. You can see, we load the dataset as usual using the torchvision.datasets module, but instead of directly putting it into a Dataloader we use the DatasetWriter to optimize and save them to the disk.

    If you take a look at the source code of write_cifar10 you will see, that we save the image in an RGBImageField and the Labels into an IntField. These two classes are essentially the only way to control how the Dataset is generated. For RGB Images we can set maximum resolution, which should definetly be done, as some images can get as big as 4288x2848 which messes with the optimized compression and saving. You can also choose a so called write_mode, which is either:

    • raw: to save all images uncompressed
    • jpg: to save all images as jpeg with the desired jpeg_quality
    • proportion: to save images with a chance of compress_probility in a desired jpeg_quality otherwise raw
    • smart: to save images large than smart_threshhold bytes in a desired jpeg_quality otherwise raw

    As this process can take quite some time, we should not run this on a communication/head node of the HPC. The tinyfat cluster is perfect for sequential processing like this script. The example script write_ffcv.sh generates optimized files for a randomized subset (100000 images of ImageNet) as well as the full Cifar10 dataset (50000 images). In this case you don't have to worry about downloading the datasets first, as we use the versions of ImageNet and Cifar10 that are already publicly hosted on the HPC (see imagenet_data_dir and cifar_data_dir constants in the shell script). As usual the job script can be started with the command: sbatch ./example/write_ffcv.sh. When the job is done the generated files will copied to the vault ($HPCVAULT/ffcv/)

    This process can take a while. The last few measured runs averaged around 15000 items per minute. Fortunately this has to be only done once.

    The larger the dataset the more efficiently you can store it:

    • cifar10_full
      • without ffcv: 178 MB
      • with ffcv (no compression, max resolution = 32): 394 MB
    • imagenet_small
      • without ffcv: no real reference
      • with ffcv: (max_resolution = 384, jpeg_quality = 50%, compress_probability = 90%) 11GB
    • imagenet_full:
      • without ffcv: 140GB train, 6,4GB val
      • with ffcv (max_resolution = 384, jpeg_quality = 40%, compress_probability = 95%): ~30 GB

    The parameters have been chosen based on available research. For example this paper shows that ~40-50% jpeg quality is the threshold range for having keeping a low loss value while training a network.

    Loading the dataset

    Now we will define the functions to load the dataset. Like before the functions are defined in example/data.py. Generally we start by defining the pipelines, which are similar to transforms in PyTorch.

    image_pipeline: List[Operation] = [
      RandomResizedCropRGBImageDecoder((image_size, image_size)),  # loads the image in a desired resolution
      ToTensor(), # Convert from Numpy array to PyTorch Tensor
      ToDevice(device, non_blocking=True),  # Move tensor to device
      ToTorchImage(channels_last=True), # Change tensor to PyTorch format for images (B x H x W x C)
      NormalizeImage(CIFAR_MEAN, CIFAR_STD, type=np.dtype(np.float16)),  # Normalize and set floating point precision
    ]
    
    label_pipeline: List[Operation] = [
      IntDecoder(),  # decode label as numpy array 
      ToTensor(),
      ToDevice(device),
      Squeeze(), # remove "empty" dimensions (dims of size 1)
    ]
    
    pipelines = {"image": image_pipeline, "label": label_pipeline}

    Note that it is still possible to use torchvision transforms. However for the best performance one should use the optimized FFCV transforms. Most torchvision transforms have an equivalent in the library.

    The next step is to define the DataLoader:

    ordering = OrderOption.QUASI_RANDOM
    loader = Loader(
      path,  # path to the .beton file to load (we usually have 1 file per split)
      batch_size=batch_size,
      num_workers=NUM_WORKERS,
      order=ordering, # Either RANDOM (does uniform sampling, therefore requires entire dataset in RAM), QUASI_RANDOM (faster but non-uniform sampling, works on large datasets) or SEQUENTIAL
      drop_last=True,  # drop last incomplete batch
      pipelines=pipelines,
      os_cache=False,  # only set to true on small datasets that can fit in system RAM
    )

    Training with FFCV

    Now that we have the optimized files, we can load them, put them into a PyTorch compatible DataLoader (ffcv just calls it Loader) and start training. In the example run_ffcv we define funtions for building the Loader. There's also some other optimizations like automatic mixed precision and grad-zeroing included in the example.

    When it comes to the training loop, the main difference to to vanilla PyTorch is that we don't have to move the data to the GPU first manually. Code like:

    for batch in dataloader:
        images = batch[0].to(device=device, non_blocking=True)
        labels = batch[1].to(device=device)

    is no longer required. Even with non_blocking=True these operations cause serious delays and downtime on the GPU. By replacing the code with for images, labels in dataloader: ..., we now let the dataloader take care of placing the next batches in the GPU memory.

    One other thing you need to be wary of is the memory format. The ToTorchImage Operations result in a Tensor in the memory format "channels last". To avoid memory issues one should set the same format for the network:

    model = torchvision.models.resnet18()
    model.to(device=torch.device("cuda"), memory_format=torch.channels_last)

    When it comes to training on the HPC we use run_ffcv.sh. As we now run the training the tinygpu cluster is used. The script specifically requests a100 GPUs as there were some problems with AMP on the other "Turing" GPUs (2080Ti). To avoid frequent and slow read to the NFS filesystem, we copy the required .beton files to the node-local $TMPDIR before starting the training.

    Metrics

    Using the example project from this repository i will show some metrics and compare the to vanilla pytorch dataloading. **TODO: **

    Caveats

    Data Types

    To save GPU memory while training we often use 16 bit instead of 32 bit floats. The lower accuracy rarely effects model performance but we can almost fit twice as many parameters on the GPU. PyTorch has a collection of different float types (most commmonly torch.bfloat16 is used, as it offers the best balance between accuracy and value range). When loading FFCV the data is initially represented as Numpy's ndarray. This of course is incompatible with PyTorch tensors aswell as Torchvision transforms. To avoid type issues a pipeline (just a list of transforms) should look as follows

    1. FFCV Decoder (such as RandomResizedCropRGBImageDecoder)
    2. FFCV Transforms (such as RandomHorizontalFlip or Cutout)
    3. Necessary Operations: ToTensor, ToDevice and ToTorchImage
    4. Now that the Data is a PyTorch tensor we have the following options to set the datatype:
      1. Convert(dtype) dtype can be a numpy or torch type
      2. NormalizeImage(mean, std, type) type must be a numpy type, however most numpy types have an equivalent that can be automatically converted to a PyTorch type # FIXME: is the automatic tpye conversion really happening?
    5. Torchvision transforms

    Other Notes on performance

    AMP

    use bfloat16 TODO

    Memory Format

    use channels_last + amp TODO

    Notes

    Most information about this guide comes from the original paper, the official website and personal experience. At the time of writing the documentation was incomplete so information was retrieved by reading the source code.

    The examples should cover most neccessary topics, but you can also find another example of integrating ffcv into a larger project in this repo.