A friend recommended me a library from LightingAI called Lit-GPT for finetuning GPT-style/generative/causal language models. I gave it a try. This blog post is about my experience.

Overall, Lit-GPT is really out-of-the-box and end-to-end for both finetuning and inference. You can read the blog posts about finetuning on standardized datasets and custom datasets, respectively.
It also has really good tutorials.

TL;DR

  • Multiple-GPU support is not yet avaialble for lora.py. NotImplementedError.
  • I was not able to finetune on StableLM-3B in any approach and number of GPUs, despite that this official doc says it should be dobale.

Speed and footprint using different PEFT methods and foundation models

All precisions are BF16.

Settings for adapter.py:

batch_size = 4 / devices
micro_batch_size = 1

Settings for LoRa.py

Single-GPU (one 450W 3090 Ti)

  Adapter LoRa
StableLM 3B Bug Single-Adapter-StableLM-3B Bug Single-LoRa-StableLM-3B
Falcon 7B 50 ms/step, 23GBi VRAM, 400W 50 ms/step, 23GBi VRAM, 400W
  Adapter LoRa
StableLM 3B Bug Dual-Adapter-StableLM-3B NotImplementedError
Falcon 7B 6000 ms/step, 24GB VRAM per GPU, 150W/GPU, 184 TFLOPs NotImplementedError

VRAM: During the validating (inference) stage, the per-GPU VRAM occupancy was indeeded halved (with some overhead) due to two GPUs. However, when it comes to training stage, the VRAM is occupied in-full for each GPU. I am not sure what device=1 in adapter.py means. Data-parallelization or model-parallelization or what? I increased batch_size from 4/devices to 10/devices. Still no OOM errors. Maybe the model is taking most of the VRAM. If it is model-paralleization, then the synchronization overhead is definitely too high as the speed is 100 times slower than single-GPU.

Speed: During training, the speed under dual-GPU setting is 100 times slower than single-GPU. It also takes extremely long, like 10 minutes, to enter the training stage.

Bug logs

Bug Single-LoRa-StableLM-3B

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [644,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Bug Single-Adapter-StableLM-3B

Traceback (most recent call last):
  File "/scratch/bao/repos/lit-gpt/finetune/adapter.py", line 299, in <module>
    CLI(setup)
  File "/home/forrest/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/forrest/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 147, in _run_component
    return component(**cfg)
  File "/scratch/bao/repos/lit-gpt/finetune/adapter.py", line 77, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir)
  File "/home/forrest/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 788, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 870, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 875, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/scratch/bao/repos/lit-gpt/finetune/adapter.py", line 115, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/scratch/bao/repos/lit-gpt/finetune/adapter.py", line 136, in train
    validate(fabric, model, val_data, tokenizer, longest_seq_length)  # sanity check
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/bao/repos/lit-gpt/finetune/adapter.py", line 221, in validate
    logits = model(input_ids)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 116, in forward
    output = self._forward_module(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/bao/repos/lit-gpt/lit_gpt/adapter.py", line 103, in forward
    x, *_ = block(x, (cos, sin), max_seq_length)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/bao/repos/lit-gpt/lit_gpt/adapter.py", line 150, in forward
    h, new_kv_cache, new_adapter_kv_cache = self.attn(
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/bao/repos/lit-gpt/lit_gpt/adapter.py", line 217, in forward
    k_roped = apply_rope(k[..., :n_elem], cos, sin)
  File "/scratch/bao/repos/lit-gpt/lit_gpt/model.py", line 318, in apply_rope
    roped = (x * cos) + (rotated * sin)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Bug Dual-Adapter-StableLM-3B

Besides the printout below, the first GPU was not released upon core dump. The process has to be manually killed.

  File "/home/forrest/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 2548, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[E ProcessGroupNCCL.cpp:838] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9942a7c517 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9942a3889b in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9941f38698 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f98b68e0e00 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f98b68e4cb8 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7f98b68fab2b in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x75 (0x7f98b68faea5 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc2b3 (0x7f99226b22b3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7f996e793b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7f996e825a00 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9942a7c517 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9942a3889b in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9941f38698 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f98b68e0e00 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f98b68e4cb8 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7f98b68fab2b in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x75 (0x7f98b68faea5 in /home/forrest/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc2b3 (0x7f99226b22b3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7f996e793b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7f996e825a00 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)