StefanG3D Posted March 30 Report Share Posted March 30 Download now Fixed Issues Resolved a kernel panic on A100 when using both MIG and DCGM. Pascal GPU page faults were hitting a NULL pointer dereference in the UVM driver while there was not enough system memory available to handle the faults. In the L1C submodule when the clock is gated, there is a corner case where the BLCG controller was not woken up from sleep state when an external submodule wants to use L1C. This was fixed by Switching the PROD value to disable L1C BLCG will not cause the hang in the chip until some other event wakes up the BLCG FSM. NVML Memory error counter is refactored to rectify the negative effects due to its low efficiency. In the code path related to allocating virtual address space, a call to reallocate memory for tracking structures was allocating less memory than needed, resulting in a potential memory trampler. The size of the reallocation is now correctly calculated. A fatal MINION error that should not be treated as fatal was causing the GPU to require a reset. These MINION errors have now been marked as nonfatal and will not affect the health of the GPU or any workloads running on it. Quote Link to comment Share on other sites More sharing options...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.