ROCM SUPPORT(1) ROCM SUPPORT(1)
NAME criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/re-
store in userspace for AMD GPUs.
CURRENT SUPPORT
Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different
system Checkpoint / Restore inside a docker container Pytorch Tensorflow
Using CRIU Image Streamer
DESCRIPTION
Though criu is a great tool for checkpointing and restoring running ap-
plications, it has certain limitations such as it cannot handle applica-
tions that have device files open. In order to support ROCm based work-
loads with criu we need to augment criu’s core functionality with a plu-
gin based extension mechanism. criu-amdgpu-plugin provides the necessary
support to criu to allow Checkpoint / Restore with ROCm.
DEPENDENCIES
amdkfd support
In order to snapshot the VRAM and other GPU device states, we re-
quire an updated version of amdkfd(amdgpu) driver.
OPTIONS
Optional parameters can be passed in as environment variables before ex-
ecuting criu command.
KFD_FW_VER_CHECK
Enable or disable firmware version check. If enabled, firmware ver-
sion on restored gpu needs to be greater than or equal firmware ver-
sion on checkpointed GPU. Default:Enabled
E.g:
KFD_FW_VER_CHECK=0
KFD_SDMA_FW_VER_CHECK
Enable or disable SDMA firmware version check. If enabled, SDMA
firmware version on restored gpu needs to be greater than or equal
firmware version on checkpointed GPU. Default:Enabled
E.g:
KFD_SDMA_FW_VER_CHECK=0
KFD_CACHES_COUNT_CHECK
Enable or disable caches count check. If enabled, the caches count
on restored GPU needs to be greater than or equal caches count on
checkpointed GPU. Default:Enabled
E.g:
KFD_CACHES_COUNT_CHECK=0
KFD_NUM_GWS_CHECK
Enable or disable num_gws check. If enabled, the num_gws on restored
GPU needs to be greater than or equal num_gws on checkpointed GPU.
Default:Enabled
E.g:
KFD_NUM_GWS_CHECK=0
KFD_VRAM_SIZE_CHECK
Enable or disable VRAM size check. If enabled, the VRAM size on re-
stored GPU needs to be greater than or equal VRAM size on check-
pointed GPU. Default:Enabled
E.g:
KFD_VRAM_SIZE_CHECK=0
KFD_NUMA_CHECK
Enable or disable NUMA CPU region check. If enabled, the plugin will
restore GPUs that belong to one CPU NUMA region to the same CPU NUMA
region. Default:Enabled
E.g:
KFD_NUMA_CHECK=1
KFD_CAPABILITY_CHECK
Enable or disable capability check. If enabled, the capability on
restored GPU needs to be equal to the capability on the checkpointed
GPU. Default:Enabled
E.g:
KFD_CAPABILITY_CHECK=1
KFD_MAX_BUFFER_SIZE
On some systems, VRAM sizes may exceed RAM sizes, and so buffers for
dumping and restoring VRAM may be unable to fit. Set to a nonzero
value (in bytes) to set a limit on the plugin’s memory usage. De-
fault:0 (Disabled)
E.g:
KFD_MAX_BUFFER_SIZE="2G"
AUTHOR
The AMDKFD team.
COPYRIGHT
Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
07/29/2025 ROCM SUPPORT(1)
Generated by dwww version 1.16 on Tue Dec 16 06:20:45 CET 2025.