This is the 13th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

When using pyTorch Dataloader, an error is reported when num_workers is not set to 0. This article documents two solutions to this error.

Dataloader – num_workers

  • The Pytorch module Dataloader has a parameter, num_workers. This parameter indicates the number of processes that load data using Dataloader.

  • Therefore, if the Dataloader is complicated, it will naturally save a lot of data loading time when there are many workers. They can load data at the same time during network training, and take the loaded data directly from the memory after network training. Therefore, when num_worker is greater than 1, data loading can be accelerated. When the number is so large that the network does not need to load the data time is the ultimate benefit of workers working for accelerated training;

  • Workers who use more than 1 consume more memory and CPU, as well as more shared memory.

  • Workers using more than 1 will call multithreading.

Problem specification

Depending on how num_worker works, there are two types of errors that can occur at work (the two I encountered) :

  • Insufficient shared memory:
RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error
Copy the code
  • A segment error in multithreading leads to deadlock, which leads to program stuck, thread blocked:
ERROR: Unexpected segmentation fault encountered in worker.
Copy the code

or

RuntimeError: DataLoader worker (pid 4499) is killed by signal: Segmentation fault. 
Copy the code

or

RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly
Copy the code

The following are two solutions to these problems.

Error 1 RuntimeError: DataLoader worker (pid XXX) is killed by signal: Bus error

Question why

  • Generally, this problem occurs in Docker. Because the default shared memory of Docker is 64M, the space is insufficient when there are a large number of workers, and errors occur.

The solution

1 Self – disuse martial arts
  • willnum_workersSet to zero
2 Solving problems
  • inCreate a dockerTo configure a large shared memory, add parameters--shm-size="15g", set the shared memory to 15GB (according to the actual situation) :
nvidia-docker run -it --name [container_name] --shm-size="15g" ...
Copy the code
  • throughdf -hTo view
# df -h Filesystem Size Used Avail Use% Mounted on overlay 3.6t 3.1t 317G 91% / TMPFS 64M 0 64M 0% /dev TMPFS 63G 0 63G 0% /sys/fs/cgroup /dev/sdb1 3.6t 3.1t 317G 91% /workspace/ TMP SHM 15G 8.1g 7.0g 54% /dev/shm TMPFS 63G 12K 63G 1% /proc/driver/nvidia /dev/sda1 219G 170G 39G 82% /usr/bin/nvidia-smi udev 63G 0 63G 0% /dev/nvidia3 tmpfs 63G 0 63G 0% /proc/acpi tmpfs 63G 0 63G 0% /proc/scsi tmpfs 63G 0 63G 0% /sys/firmwareCopy the code
  • SHM is the shared memory space

Problem 2 RuntimeError: DataLoader worker (pid(s) ****) exited unexpectedly

Question why

  • Due to thedataloaderThe use of multithreaded operations, if there are other problems with multithreaded operations may lead to thread thread, prone to deadlock
  • The exact situation may vary depending on the environment, mine is due to multi-threading in OpencVdataloaderThere is a problem with the mixture of;
  • At this time, CV version 3.4.2, the same code in 4.2.0.34 CV has no problem.

The solution

1 Self – disuse martial arts
  • willnum_workersSet to zero
2 Solving problems
  • In the dataloader__getitem__Method to disable multithreading in OpencV:
def __getitem__(self, idx) :
	import cv2
	cv2.setNumThreads(0)...Copy the code

The resources

  • Blog.csdn.net/willen_/art…

  • zhuanlan.zhihu.com/p/133707658