Remember once the online Docker container memory usage is too high do not release the problem of the process

Problem description

Run the Hive statement in jupyter to save the table query results to the local directory. Monitor the memory from Prometheus until it reaches the maximum memory limit of 35G. In the container, run the top command, ps command, or Docker stats command to check that the memory usage is far less. Considering the OOM Killed situation during model deployment or user reaction, I began to study the cause of this phenomenon.

Problem analysis

The memory values are different because the memory meanings of the output commands are different. For details, see the first and second commands. If the memory is too high, do not release the memory. Check item 3.

1. Memory values of the Prometheus and docker stats commands

  • Prometheus: Without introducing the architecture of Prometheus, the monitoring of the online K8S cluster is mainly based on two exporters:

    1. Node Exporter collects host hardware and operating system data. It will run as a POD on each host and can be viewed using kubectl get Po -n Monitoring

      Export Sends the collected data to the Prometheus Server over HTTP. You can view indicators through target on the Prometheus interface

    2. The cAdvisor is responsible for collecting the container data. It will run on all hosts together with Kubelet.

      CAdvisor is a container monitoring tool developed by Google. You can visit github.com/cadvisor to view the monitoring indicators provided for Prometheus and their meanings. Total page cache memory container_memory_usage_bytes: Current memory usage, including all memory regardless of when it was accessed container_memory_working_set_bytes: Current working set

      You can see that cache is included in the command container_Memory_usage_Bytes for monitoring Prometheus

  • docker stats

    The following description can be seen from the official Docker document:

On Linux, the Docker CLI reports memory usage by subtracting page cache usage from the total memory usage.

Therefore, the actual memory size obtained by running the docker stats command is container_memory_usage_bytes-container_memory_cache

[root@TX-220-54-202 ~]# docker stats f29f61770a02 --no-streamCONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS f29f61770a02 K8s_notebook_jupyter-ningsheng - 0kCtiqFg_kubeflow_35a5db12-b7FA-11E9-9fa1-246e967D5d94_2 0.13% 619.9MiB / 8GiB 7.57% 0B / 0B 16.9MB / 520kB 313 [root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.limit_in_bytes
8589934592
[root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1710297088
[root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.stat
cache 1059909632        # page cache, including tmpfs (shmem), in bytes
rss 546828288           # anonymous and swap cache, not including tmpfs (shmem), in bytes
mapped_file 2973696     # size of memory-mapped mapped files, including tmpfs(shmem), in bytes
inactive_anon 0         # anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
active_anon 547405824   # file-backed memory on active LRU list, in bytes
inactive_file 1048981504# file-backed memory on inactive LRU list, in bytes
active_file 12390400    # file-backed memory on active LRU list, in bytes
# active_file + inactive_file = cache - size of tmpfs
# active_anon + inactive_anon = anonymous memory + file cache for tmpfs + swap cache
Copy the code

For details about other indicators in the memory.stat file, see resource_management_guide

  • conclusion

    Prometheus picked container_memory_usage_bytes cAdvise data, namely the container inside the cat/sys/fs/cgroup/memory/memory. Usage_in_bytes valuesPrometheus picked container_memory_cache cAdvise data, namely the container inside the cat/sys/fs/cgroup/memory/memory. Stat

2. Description of the memory usage of the TOP or PS process

Processes use the following memory:

A. Anonymous mapping pages in user space: for example, calling malloc allocated memory and using MAP_ANONYMOUS mmap; When the system runs out of memory, the kernel can swap it out.

B. User space map file: for example, the mmap of the specified file. The kernel can reclaim these pages when the system runs out of memory.

C. User space file map TMFS: for example, shmem. The kernel can reclaim these pages when the system runs out of memory.

D. Page cache: for example, the buffer cache for reading files on block devices and the file system

  • Information read from the cat /proc/[pid]/status file

    VmPeak: 4372 kB Peak memory usage

    VmSize: 4372 kB Indicates the virtual address space size of a process

    VmLck: 0 kB Specifies the size of the physical memory that is locked by the process. The locked physical memory cannot be swapped to hard disks

    VmPin: 0 kB

    VmHWM: 656 kB Indicates the peak value of physical memory used by a process

    VmRSS: 656 kB Physical memory size being used by the process

    RssAnon: 84 kB

    RssFile: 572 kB

    RssShmem: 0 kB

  • Top or Ps view

[root@jupyter-ningsheng-0kctiqfg notebook]# ps -auxUSER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 4372 656? Ss 8月14 0:01 tini -- start-singleuser.sh -- IP =0.0.0.0 --port=8888 --allow-root [root@jupyter-ningsheng-0kctiqfg notebook]# top. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 4372 6556 572 S 0.0 0.0 0:01.20 tiniCopy the code

Parameter description:

  1. VSZ & VIRT

If the process requests 100 MB of memory but only uses 10 MB, it will increase by 100 MB instead of the actual usage

Virtual memory used by processes

  1. RES & RSS

If you apply for 100m of memory and actually use 10m, it will only grow by 10m

VmRSS = RssAnon + RssFile + RssShmem

  • The difference between process RSS and Cgroup RSS

Stat: RssAnon, less SHmem, mapped file than PS

3. The memory is too high and cannot be released

  • Phenomenon:

User code analysis shows that when hive saves the queried table results to the local directory (table data is 100 GB), the memory grows until the limit is reached.

  • Experimental simulation:

Dd if=/dev/zero of=/export/test bs=1024M count=10 But malloc then quickly frees up a lot of memory

[root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.stat |head -n 1
cache 2506752
[root@jupyter-ningsheng-0kctiqfg notebook]# dd if=/dev/zero of=/export/test bs=1024M count=10
10+0 records in10+0 Records out 10737418240 bytes (11 GB) copied, 31.7019s, 339 MB/s [root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.stat |head -n 1
cache 6566047744
[root@jupyter-ningsheng-0kctiqfg notebook]# ./mem-allocate
1G memory allocated
2G memory allocated
3G memory allocated
4G memory allocated
5G memory allocated
6G memory allocated
7G memory allocated
[root@jupyter-ningsheng-0kctiqfg notebook]# cat /sys/fs/cgroup/memory/memory.stat |head -n 1
cache 1810432

Copy the code
  • Problem analysis:
  1. When a write is written to a file system, the write is not immediately committed to disk, which would be very inefficient. Instead, it is written to the memory region of the page cache and periodically to disk in blocks.

    ‘Flush’ on Linux is responsible for writing dirty pages to disk. It is a daemon that wakes up periodically to determine if writing to disk is needed and, if so, executes.

    Flush runs outside the container because the container does not have its own kernel, cannot use its own kernel or refresh daemon, and must wait for the physical machine to Flush or manually run echo 1 > /proc/sys/vm-drop_caches on the physical machine.

  2. The value of /proc/sys/vm-dirty_background_ratio is 10%

    Vm. Dirty_background_ratio is the percentage of memory that can be filled with “dirty data”. This “dirty data” is later written to disk, and background processes such as PDFlush/Flush/kdmFlush clean up the dirty data later. For example, if host has 512GB of memory, then the entire dedicated server cache is determined to have 51.2GB of memory before clearing it.

  3. When allocating buffer/cache memory, the kernel writes back the dirty data to ensure that the data is consistent before clearing it and allocating it to the process. If your process is suddenly applying for a lot of memory, your business is generating a lot of dirty data (such as logs), and the system is not writing back in time, the system will allocate memory to the process very slowly, and the SYSTEM will have high IO.

4. The problem of OOM

Phenomenon:

[k8s@TX-220-54-9 ~]$ kubectl get po --all-namespaces -o wide |grep OOM model-deployment Svc-0356fa3-960d-4c89-8020-741eae1851c60/1 OOMKilled 0 2d19h 10.244.2.17x-220-54-9.h.chinabank.com.cn <none> <none> svC-0356FA3-960d-4c89-8020-741eAE1851C60/1 OOMKilled 0 2d19h 10.244.2.17x-220-54-9.h.chinabank.com.cnCopy the code

Problem analysis: dockerId: 2 e8e56c5bf8a2e4a5c5ef824f1323d44f2194b9ffbb4336d579bd0ff435a2f98 view 54-9 host kernel log dmesg as follows,

[7600330.400352] the Memory cgroup statsfor/kubepods.slice/kubepods-pod9df61518_c02a_11e9_9694_246e967d5d94.slice/docker-2e8e56c5bf8a2e4a5c5ef824f1323d44f2194b9ffb b4336d579bd0ff435a2f98.scope: cache:0KB rss :66947340KB rss_huge:11945984KB shmem:0KB mapped_file:0KB dirty:0KB writeback:2508KB swap:0KB Inactive_anon :0KB active_anON :66949060KB Inactive_file :0KB Active_file :4KB unevictable:0KB 4397 [7600330.400359] Tasks state (memory valuesinpages): 4398 [7600330.400359] [PID] UID TGID total_VM RSS PGtables_bytes SWapents oom_score_adj name 4399 [7600330.400665] [ 26635] 0 26635 255 1 28672 0-998 pause 4400 [7600330.400668] [26792] 0 26792 5522 2561 86016 0-998 bash 4401 [7600330.400677] [26971] 0 26971 9650443 28076 1048576 0-998 Java 4402 [7600330.400679] [27038] 0 27038 5513 2523 77824 0-998 bash 4403 [7600330.400680] [27044] 0 27044 5515 2554 81920 0-998 bash 4404 [7600330.400683] [27136] 0 27136 21533 1321 217088 0-998 SU 4405 [7600330.400685] [27137] 1000 27137 4982 2516 81920 0-998 bash 4406 [7600330.400687] [27220] 1000 27220 17824362 16715656 134701056 0-998 Python 4407 [7600330.400753] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-2e8e56c5bf8a2e4a5c5ef824f1323d44f2194b9ffbb4336d579bd0 ff435a2f98.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepod s-pod9df61518_c02a_11e9_9694_246e967d5d94.slice,task_memcg=/kubepods.slice/kubepods-pod9df61518_c02a_11e9_9694_246e967d5 d94.slice/docker-2e8e56c5bf8a2e4a5c5ef824f1323d44f2194b9ffbb4336d579bd0ff435a2f98.sc ope,task=python,pid=27220,uid=1000 4408 [7600330.400758] Memory cgroup out of Memory: Kill process 27220 (Python) Score 0 or sacrifice childCopy the code

You can see that the cache requested by the user process is 0, and the RSS is over 70 gb. Therefore, the memory space requested by the user code is really large, and the user needs to check the code or increase the memory requested.

Solution:

  1. Change the Prometheus memory monitoring item to container_memory_working_set_bytes or display both cache and working memory instead of useg_memory

  2. User OOM problem

    A. You are advised to check the code, o&M, and kernel logs to comprehensively analyze the cause.

    B. Consider disabling cGroup oom Killer. When the kernel cannot allocate enough memory to a process, it pauses the process until it has free memory and then continues.