Follow

VRAM tier allocation Out of Memory

Incident/Synopsis

Query failed and errored '[VRAM] allocateInTier OOM: Exceeds Tier Capacity'.

 

Problem Detail

Query job posted by user failed with error '[VRAM] allocateInTier OOM: Exceeds Tier Capacity'.
In gpudb.log file, the problem can be confirmed in WARN lines displaying similar as following.
Location: /opt/gpudb/core/logs/gpudb.log
2020-08-14 07:24:49.575 WARN  (37302,40881,r8/gpudb_wi_c_219 ) host.domain ResourceManagement/MemoryTierManager.cpp:432 - JobId:4035484; [VRAM] allocateInTier OOM: Exceeds Tier Capacity
2020-08-14 07:24:49.575 WARN  (37302,40881,r8/gpudb_wi_c_219 ) host.domain cuda/DeviceMemoryPtr.cpp:133 - VRAM allocation using cudaHostAlloc, size: 2048000000, attempt: 1
2020-08-14 07:24:53.414 WARN  (37302,40881,r8/gpudb_wi_c_219 ) host.domain ResourceManagement/MemoryTierManager.cpp:837 - JobId:4035484; [VRAM] [Allocate] exceeds capacity, needed: 4136250367 free: 359383650

 

Enviroment

Kinetica On-prem 7.0.x

 

Cause

The currently running jobs are taking much space in VRAM tier while new coming jobs fail to acquire the allocation they need.
As result, Kinetica writes VRAM Out of Memory in gpudb.log due to it is not able to allocate the size required by the job to successfully proceed. In this case, Kinetica identifies not enough objects can be evicted based on their eviction priority, hence the query job fails and not proceed.

 

Solution/Answer

  1. Further analysis is needed to check tables cardinality involved in running jobs. Some of the columns are hopefully eligible to be dictionary encoded, thus effectively saving memory.

    To check dictionary encoding eligibility, go to Table > Select the table name > click Stats on the top ribbon. Stats page will show you which column have low to medium cardinality and should be dict encoded.

    Note that primary key or shard key columns are exception to dictionary encoding. Read also: Using Dictionary Encoding to Save Memory and Rank-Fallback-Allocator Alert.

    Screen_Shot_2020-09-08_at_14.25.12.png
  2. Secondly, Check value of max_concurrent_kernels parameter in gpudb.conf file. Too high of value can cause too many jobs to be sent to the VRAM and exhausting VRAM, causing the error.

Location: /opt/gpudb/core/etc/gpudb.conf

In the case of max_concurrent_kernels parameter value, you can try to set the value starting from 2 afterward increase iteratively by closely monitoring the system for finding the balance between performance and system capacity.

 

References:

Refer to concept of evictability and related documentations:

https://www.kinetica.com/docs/rm/concepts.html#evictability

https://support.kinetica.com/hc/en-us/articles/360050488794-rank-fallback-allocator-alerts

https://support.kinetica.com/hc/en-us/articles/360002749874-Using-Dictionary-Encoding-to-Save-Memory

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

0 Comments

Article is closed for comments.