FFCV guide
Introduction
FFCV is a library that helps increasing the training speed of deep learning models. Commonly there is a simple cycle:
- the CPU loads the batch from the disk
- the batch gets moved to the CUDA device
- the CUDA device runs the inference
This can easily become a bottleneck if either the CPU or the Disk is too slow. FFCV solves this problem by first storing the dataset in an optimized format and then using asynchronus preloading to remove any possible bottlenecks.
This guide will present some tips and insights on how to efficiently use FFCV. The focus will lie on improving performance of model training on the HPC, as dataloading bottlenecks commonly occure in the HPC environment, either due to slow disk speeds (e.g the NFS file system), underutilized parallelism or inefficient memory usage etc.
Installation
This guide includes a small fully contained example project in the example
directory, but integrating it into a larger project should work the same way.
Prerequisites
-
conda runtime. On HPC this can be loaded using
module purge && module load python/3.10-anaconda
(module purge
first unloads all modules, such as the default python 3.8 module, then loads the anaconda runtime) - Around 10GB of storage for installing the required packages (by default conda installs them in
$HOME
but the location can be changed)
Installation with Pip
Unfortunately pip cannot be used to install FFCV. Since pip only support the installation of python packages (contrary to conda), it is impossible to install all neccessary dependencies for FFCV (such as opencv, ffmpeg and jpeg_turbo).
Installation with Conda
Since the (C++) libaries are not preinstalled, we have to use conda to install them.
The file environment.yml
creates an environment with all required dependencies upon calling conda env create --name ffcv-guide --file environment.yml
. Don't forget to activate the environment afterwards using conda activate ffcv-guide
.
Setup FFCV
As mentioned before, FFCV works by storing the dataset in a custom format (so called .beton
-files). This turns thousands of small files (typical for image datasets like imagenet1k) into a single large file. When creating the beton
-file the user can set a compression ratio as well a maximum resolution.
Generating the .beton files
The script write_ffcv.py
already contains example methods for writing Cifar10 and Imagenet. You can see, we load the dataset as usual using the torchvision.datasets
module, but instead of directly putting it into a Dataloader we use the DatasetWriter
to optimize and save them to the disk.
If you take a look at the source code of write_cifar10
you will see, that we save the image in an RGBImageField
and the Labels into an IntField
. These two classes are essentially the only way to control how the Dataset is generated. For RGB Images we can set maximum resolution, which should definetly be done, as some images can get as big as 4288x2848 which messes with the optimized compression and saving. You can also choose a so called write_mode
, which is either:
-
raw
: to save all images uncompressed -
jpg
: to save all images as jpeg with the desiredjpeg_quality
-
proportion
: to save images with a chance ofcompress_probility
in a desiredjpeg_quality
otherwise raw -
smart
: to save images large thansmart_threshhold
bytes in a desiredjpeg_quality
otherwise raw
As this process can take quite some time, we should not run this on a communication/head node of the HPC.
The tinyfat
cluster is perfect for sequential processing like this script.
The example script write_ffcv.sh
generates optimized files for a randomized subset (100000 images of ImageNet) as well as the full Cifar10 dataset (50000 images).
In this case you don't have to worry about downloading the datasets first, as we use the versions of ImageNet and Cifar10 that are already publicly hosted on the HPC (see imagenet_data_dir
and cifar_data_dir
constants in the shell script).
As usual the job script can be started with the command: sbatch ./example/write_ffcv.sh
.
When the job is done the generated files will copied to the vault ($HPCVAULT/ffcv/
)
This process can take a while. The last few measured runs averaged around 15000 items per minute. Fortunately this has to be only done once.
The larger the dataset the more efficiently you can store it:
-
cifar10_full
- without ffcv: 178 MB
- with ffcv (no compression,
max resolution = 32
): 394 MB
-
imagenet_small
- without ffcv: no real reference
- with ffcv: (
max_resolution = 384
,jpeg_quality = 50%
,compress_probability = 90%
) 11GB
-
imagenet_full:
- without ffcv: 140GB train, 6,4GB val
- with ffcv (
max_resolution = 384
,jpeg_quality = 40%
,compress_probability = 95%
): ~30 GB
The parameters have been chosen based on available research. For example this paper shows that ~40-50% jpeg quality is the threshold range for having keeping a low loss value while training a network.
Loading the dataset
Now we will define the functions to load the dataset.
Like before the functions are defined in example/data.py
.
Generally we start by defining the pipelines, which are similar to transforms in PyTorch.
image_pipeline: List[Operation] = [
RandomResizedCropRGBImageDecoder((image_size, image_size)), # loads the image in a desired resolution
ToTensor(), # Convert from Numpy array to PyTorch Tensor
ToDevice(device, non_blocking=True), # Move tensor to device
ToTorchImage(channels_last=True), # Change tensor to PyTorch format for images (B x H x W x C)
NormalizeImage(CIFAR_MEAN, CIFAR_STD, type=np.dtype(np.float16)), # Normalize and set floating point precision
]
label_pipeline: List[Operation] = [
IntDecoder(), # decode label as numpy array
ToTensor(),
ToDevice(device),
Squeeze(), # remove "empty" dimensions (dims of size 1)
]
pipelines = {"image": image_pipeline, "label": label_pipeline}
Note that it is still possible to use torchvision transforms. However for the best performance one should use the optimized FFCV transforms. Most torchvision transforms have an equivalent in the library.
The next step is to define the DataLoader:
ordering = OrderOption.QUASI_RANDOM
loader = Loader(
path, # path to the .beton file to load (we usually have 1 file per split)
batch_size=batch_size,
num_workers=NUM_WORKERS,
order=ordering, # Either RANDOM (does uniform sampling, therefore requires entire dataset in RAM), QUASI_RANDOM (faster but non-uniform sampling, works on large datasets) or SEQUENTIAL
drop_last=True, # drop last incomplete batch
pipelines=pipelines,
os_cache=False, # only set to true on small datasets that can fit in system RAM
)
Training with FFCV
Now that we have the optimized files, we can load them, put them into a PyTorch compatible DataLoader
(ffcv just calls it Loader
) and start training.
In the example run_ffcv
we define funtions for building the Loader
. There's also some other optimizations like automatic mixed precision and grad-zeroing included in the example.
When it comes to the training loop, the main difference to to vanilla PyTorch is that we don't have to move the data to the GPU first manually. Code like:
for batch in dataloader:
images = batch[0].to(device=device, non_blocking=True)
labels = batch[1].to(device=device)
is no longer required. Even with non_blocking=True
these operations cause serious delays and downtime on the GPU. By replacing the code with for images, labels in dataloader: ...
, we now let the dataloader take care of placing the next batches in the GPU memory.
One other thing you need to be wary of is the memory format. The ToTorchImage
Operations result in a Tensor in the memory format "channels last". To avoid memory issues one should set the same format for the network:
model = torchvision.models.resnet18()
model.to(device=torch.device("cuda"), memory_format=torch.channels_last)
When it comes to training on the HPC we use run_ffcv.sh
. As we now run the training the tinygpu
cluster is used. The script specifically requests a100 GPUs as there were some problems with AMP on the other "Turing" GPUs (2080Ti).
To avoid frequent and slow read to the NFS filesystem, we copy the required .beton
files to the node-local $TMPDIR
before starting the training.
Metrics
Using the example project from this repository i will show some metrics and compare the to vanilla pytorch dataloading. **TODO: **
Caveats
Data Types
To save GPU memory while training we often use 16 bit instead of 32 bit floats. The lower accuracy rarely effects model performance but we can almost fit twice as many parameters on the GPU.
PyTorch has a collection of different float types (most commmonly torch.bfloat16
is used, as it offers the best balance between accuracy and value range).
When loading FFCV the data is initially represented as Numpy's ndarray
.
This of course is incompatible with PyTorch tensors aswell as Torchvision transforms.
To avoid type issues a pipeline (just a list of transforms) should look as follows
- FFCV Decoder (such as
RandomResizedCropRGBImageDecoder
) - FFCV Transforms (such as
RandomHorizontalFlip
orCutout
) - Necessary Operations:
ToTensor
,ToDevice
andToTorchImage
- Now that the Data is a PyTorch tensor we have the following options to set the datatype:
-
Convert(dtype)
dtype can be a numpy or torch type -
NormalizeImage(mean, std, type)
type must be a numpy type, however most numpy types have an equivalent that can be automatically converted to a PyTorch type # FIXME: is the automatic tpye conversion really happening?
-
- Torchvision transforms
Other Notes on performance
AMP
use bfloat16 TODO
Memory Format
use channels_last + amp TODO
Notes
Most information about this guide comes from the original paper, the official website and personal experience. At the time of writing the documentation was incomplete so information was retrieved by reading the source code.
The examples should cover most neccessary topics, but you can also find another example of integrating ffcv into a larger project in this repo.