Follow

Query return GPU error: 39 uncorrectable ECC error encountered

Incident Synopsis:

Query return GPU error: 39 uncorrectable ECC error encountered

 

Problem/Question:

Found errors in some of the queries, including health check queries :

[GPUdb]executeSql: Error: 'All requests threw errors, first error: Received error: Host->GPU cudaMemcpy GPU error: 39 'uncorrectable ECC error encountered' (c/GCCc:57) (S/SDc:1178); code:1 'Error' in Job process'

 

Problem Detail:

             Logs : /opt/gpudb/core/logs/gpudb.log (rolling logs on rank X node)

2020-06-16 13:01:01.136 ERROR (10024,10677,rX/gpudb_wi_c_155) ----- cuda/GaiaCudaCopy.cu:61 - cudaStreamSynchronize 'GPU error: 39 'uncorrectable ECC error encountered' (c/GCCc:61)'

            Output ECC errors  on rank X node using  nvidia-smi -a

             ECC Errors

                              Volatile

                                    Single Bit           

                                         Device Memory       : 2

                                    Double Bit           

                                         Device Memory       : 2

  

            Unhealthy Component

            -------------------

            Component ID: GPU card rank X node.

            Health: GPU error: 39 uncorrectable ECC error encountered.

            Health Reason: Only queries that are being directed to rX node will have issues. 

            Health Recommendation: Hardware replacement. 

 

Environment :

Kinetica on-prem 7.0.15.4

 

Cause :

HW error

 

Solution/Answer :

Replaced GPU card on rank X node and reach out to NVIDIA for diagnostics/RMA of this card. 

 

Special Considerations :

Once an administrator sees this error, they will need to check their logs on their other nodes to identify the rank with the GPU that has issues (include an example failure). And also include the nvidia-smi -a command (with an example output) that indicates the ECC errors referenced.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.