dwww Home | Manual pages | Find package

ROCM SUPPORT(1)                                                 ROCM SUPPORT(1)

NAME  criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/re-
       store in userspace for AMD GPUs.

CURRENT SUPPORT

       Single and Multi GPU systems (Gfx9) Checkpoint /  Restore  on  different
       system Checkpoint / Restore inside a docker container Pytorch Tensorflow
       Using CRIU Image Streamer

DESCRIPTION

       Though  criu is a great tool for checkpointing and restoring running ap-
       plications, it has certain limitations such as it cannot handle applica-
       tions that have device files open. In order to support ROCm based  work-
       loads with criu we need to augment criu’s core functionality with a plu-
       gin based extension mechanism. criu-amdgpu-plugin provides the necessary
       support to criu to allow Checkpoint / Restore with ROCm.

DEPENDENCIES

       amdkfd support
           In  order  to  snapshot the VRAM and other GPU device states, we re-
           quire an updated version of amdkfd(amdgpu) driver.

OPTIONS

       Optional parameters can be passed in as environment variables before ex-
       ecuting criu command.

       KFD_FW_VER_CHECK
           Enable or disable firmware version check. If enabled, firmware  ver-
           sion on restored gpu needs to be greater than or equal firmware ver-
           sion on checkpointed GPU. Default:Enabled

               E.g:
               KFD_FW_VER_CHECK=0

       KFD_SDMA_FW_VER_CHECK
           Enable  or  disable  SDMA  firmware  version check. If enabled, SDMA
           firmware version on restored gpu needs to be greater than  or  equal
           firmware version on checkpointed GPU. Default:Enabled

               E.g:
               KFD_SDMA_FW_VER_CHECK=0

       KFD_CACHES_COUNT_CHECK
           Enable  or  disable caches count check. If enabled, the caches count
           on restored GPU needs to be greater than or equal  caches  count  on
           checkpointed GPU. Default:Enabled

               E.g:
               KFD_CACHES_COUNT_CHECK=0

       KFD_NUM_GWS_CHECK
           Enable or disable num_gws check. If enabled, the num_gws on restored
           GPU  needs  to be greater than or equal num_gws on checkpointed GPU.
           Default:Enabled

               E.g:
               KFD_NUM_GWS_CHECK=0

       KFD_VRAM_SIZE_CHECK
           Enable or disable VRAM size check. If enabled, the VRAM size on  re-
           stored  GPU  needs  to  be greater than or equal VRAM size on check-
           pointed GPU. Default:Enabled

               E.g:
               KFD_VRAM_SIZE_CHECK=0

       KFD_NUMA_CHECK
           Enable or disable NUMA CPU region check. If enabled, the plugin will
           restore GPUs that belong to one CPU NUMA region to the same CPU NUMA
           region. Default:Enabled

               E.g:
               KFD_NUMA_CHECK=1

       KFD_CAPABILITY_CHECK
           Enable or disable capability check. If enabled,  the  capability  on
           restored GPU needs to be equal to the capability on the checkpointed
           GPU. Default:Enabled

               E.g:
               KFD_CAPABILITY_CHECK=1

       KFD_MAX_BUFFER_SIZE
           On some systems, VRAM sizes may exceed RAM sizes, and so buffers for
           dumping  and  restoring  VRAM may be unable to fit. Set to a nonzero
           value (in bytes) to set a limit on the plugin’s  memory  usage.  De-
           fault:0 (Disabled)

               E.g:
               KFD_MAX_BUFFER_SIZE="2G"

AUTHOR

       The AMDKFD team.

COPYRIGHT

       Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)

                                   07/29/2025                   ROCM SUPPORT(1)

Generated by dwww version 1.16 on Tue Dec 16 06:20:45 CET 2025.