dwww Home | Manual pages | Find package

ROCM SUPPORT(1)                                                ROCM SUPPORT(1)

NAME
       amdgpu_plugin  -  A  plugin extension to CRIU to support checkpoint/re-
       store in userspace for AMD GPUs.

CURRENT SUPPORT
       Single and Multi GPU systems (Gfx9) Checkpoint / Restore  on  different
       system  Checkpoint  / Restore inside a docker container Pytorch Tensor-
       flow Using CRIU Image Streamer

DESCRIPTION
       Though criu is a great tool for checkpointing and restoring running ap-
       plications,  it has certain limitations such as it cannot handle appli-
       cations that have device files open. In order  to  support  ROCm  based
       workloads with criu we need to augment criu’s core functionality with a
       plugin based extension mechanism. amdgpu_plugin provides the  necessary
       support to criu to allow Checkpoint / Restore with ROCm.

   Dependencies
       amdkfd support
           In  order  to snapshot the VRAM and other GPU device states, we re-
           quire an updated  version  of  amdkfd(amdgpu)  driver.  The  kernel
           patches are under review currently.

       criu 3.16
           This work is rebased on latest criu release available at this time.

OPTIONS
       Optional  parameters  can  be passed in as environment variables before
       executing criu command.

       KFD_FW_VER_CHECK
           Enable or disable firmware version check. If enabled, firmware ver-
           sion  on  restored  gpu  needs to be greater than or equal firmware
           version on checkpointed GPU. Default:Enabled

               E.g:
               KFD_FW_VER_CHECK=0

       KFD_SDMA_FW_VER_CHECK
           Enable or disable SDMA firmware version  check.  If  enabled,  SDMA
           firmware  version on restored gpu needs to be greater than or equal
           firmware version on checkpointed GPU. Default:Enabled

               E.g:
               KFD_SDMA_FW_VER_CHECK=0

       KFD_CACHES_COUNT_CHECK
           Enable or disable caches count check. If enabled, the caches  count
           on  restored  GPU needs to be greater than or equal caches count on
           checkpointed GPU. Default:Enabled

               E.g:
               KFD_CACHES_COUNT_CHECK=0

       KFD_NUM_GWS_CHECK
           Enable or disable num_gws check. If enabled,  the  num_gws  on  re-
           stored  GPU  needs  to  be  greater than or equal num_gws on check-
           pointed GPU. Default:Enabled

               E.g:
               KFD_NUM_GWS_CHECK=0

       KFD_VRAM_SIZE_CHECK
           Enable or disable VRAM size check. If enabled, the VRAM size on re-
           stored  GPU  needs  to be greater than or equal VRAM size on check-
           pointed GPU. Default:Enabled

               E.g:
               KFD_VRAM_SIZE_CHECK=0

       KFD_NUMA_CHECK
           Enable or disable NUMA CPU region check.  If  enabled,  the  plugin
           will  restore  GPUs  that belong to one CPU NUMA region to the same
           CPU NUMA region. Default:Enabled

               E.g:
               KFD_NUMA_CHECK=1

       KFD_CAPABILITY_CHECK
           Enable or disable capability check. If enabled, the  capability  on
           restored  GPU  needs  to  be  equal to the capability on the check-
           pointed GPU. Default:Enabled

               E.g:
               KFD_CAPABILITY_CHECK=1

AUTHOR
       The AMDKFD team.

COPYRIGHT
       Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)

                                  11/20/2024                   ROCM SUPPORT(1)

Generated by dwww version 1.15 on Wed Sep 3 11:40:31 CEST 2025.