Part 2: Detailed explanation of FtP-RCNN algorithm in KerAS version (2. ROI Calculation and others)

After learning Udacity’s machine learning and deep learning courses a few days ago, I felt that I had just touched the threshold of deep learning. So I started to read Stanford’s CS231N (the latest release of portal CS321N spring 2017 class), and accidentally fell into the pit of computer vision. In addition to object recognition, it can also perform localization, object detection, semantic segmentation, instance segmentation, and left-right interaction (GAN). Transfer learning, etc… It was an eye-opener. Start with Detection and go!

Detection, of course, is RGB god’s series of work, from RCNN all the way to YOLO. YOLO: Real-time Object Detection YOLO: Real-time Object Detection Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Since TensorFlow is not very proficient, most projects are done with Keras, so I found a KerAS version of FtP-RCNN on Github to learn about it. Basically, after clone, you can make a few tweaks to the code and run successfully. I used Oxford pet data set for training, and trained on my master card GTX970 (if you are still frustrated that even a PIECE of GTX970 does not fit, I recommend you to take off immediately) for more than an hour, and I was able to implement detection more effectively. Below are the renderings.

The next step is to understand the code. The core idea of ftP-RCNN is to carry out region proposal through RPN instead of the previous independent steps, so as to realize complete end-to-end learning and speed up the algorithm. So reading RPN is the first step to understanding ftYL-RCNN. The following code is how to get the ground truth used to train RPN, fully understand, also understand the principle of RPN.

The calculation process is quite long, but there is no complex mathematical knowledge, I drew a general flow chart, based on this should be much easier to understand.

Here’s the code:

def calc_rpn(C. img_data. width. height. resized_width. resized_height. img_length_calc_function) :

	downscale = float(C.rpn_stride)
	anchor_sizes = C.anchor_box_scales
	anchor_ratios = C.anchor_box_ratios
	num_anchors = len(anchor_sizes) * len(anchor_ratios)	

	# calculate the output map size based on the network architecture

	(output_width. output_height) = img_length_calc_function(resized_width. resized_height)

	n_anchratios = len(anchor_ratios)

	# initialise empty output objectives
	y_rpn_overlap = np.zeros((output_height. output_width. num_anchors))
	y_is_box_valid = np.zeros((output_height. output_width. num_anchors))
	y_rpn_regr = np.zeros((output_height. output_width. num_anchors * 4))

	num_bboxes = len(img_data['bboxes'])

	num_anchors_for_bbox = np.zeros(num_bboxes).astype(int)
	best_anchor_for_bbox = -1*np.ones((num_bboxes. 4)).astype(int)
	best_iou_for_bbox = np.zeros(num_bboxes).astype(np.float32)
	best_x_for_bbox = np.zeros((num_bboxes. 4)).astype(int)
	best_dx_for_bbox = np.zeros((num_bboxes. 4)).astype(np.float32)

	# get the GT box coordinates, and resize to account for image resizing
	gta = np.zeros((num_bboxes. 4))
	for bbox_num. bbox in enumerate(img_data['bboxes') :
		# get the GT box coordinates, and resize to account for image resizing
		gta[bbox_num. 0] = bbox['x1'] * (resized_width / float(width))
		gta[bbox_num. 1] = bbox['x2'] * (resized_width / float(width))
		gta[bbox_num. 2] = bbox['y1'] * (resized_height / float(height))
		gta[bbox_num. 3] = bbox['y2'] * (resized_height / float(height))
Copy the code

Let’s take a look at the parameters. C is the configuration information, img_data contains the path of an image, the bbox coordinates, and the corresponding classification (there may be multiple groups of an image, which means there are multiple objects in the image). Img_length_calc_function is a method based on our Settings to calculate the size of the feature graph after the network from the size of the image.

Downscale is the scaling multiple from the picture to the feature graph. Anchor_size and anchor_ratios are the parameters of our initial selection size. For example, 3 sizes and 3 ratios can be combined into 9 different shapes and sizes. Next, go to img_….. Function This method calculates the dimensions of the feature graph.

The next step is to initialize a couple of variables that you can ignore until you use them later. Since our calculation is based on images after resize, x1,x2,y1 and y2 in bbox are matched to images after resize respectively by scaling. This is called GTA and the size is (num_of_bbox,4).

for anchor_size_idx in range(len(anchor_sizes)) :
	for anchor_ratio_idx in range(n_anchratios) :
		anchor_x = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx] [0]
		anchor_y = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx] [1]	

		for ix in range(output_width) :					
			# x-coordinates of the current anchor box	
			x1_anc = downscale * (ix + 0.5) - anchor_x / 2
			x2_anc = downscale * (ix + 0.5) + anchor_x / 2	

			# ignore boxes that go across image boundaries					
			if x1_anc < 0 or x2_anc > resized_width:
				continue

			for jy in range(output_height) :

				# y-coordinates of the current anchor box
				y1_anc = downscale * (jy + 0.5) - anchor_y / 2
				y2_anc = downscale * (jy + 0.5) + anchor_y / 2

				# ignore boxes that go across image boundaries
				if y1_anc < 0 or y2_anc > resized_height:
					continue

				# bbox_type indicates whether an anchor should be a target 
				bbox_type = 'neg'

				# this is the best IOU for the (x,y) coord and the current anchor
				# note that this is different from the best IOU for a GT bbox
				best_iou_for_loc = 0.0		
Copy the code

The above paragraph calculates the length and width of anchor, and it is more important to take each point in the feature graph as an anchor point, multiply downscale to map to the actual size of the picture, and then combine the size of anchor to ignore those beyond the scope of the picture. Rectangles of different sizes and proportions sprang from the page. Iterate over these boxes and perform the following calculation for each box:

# bbox_type indicates whether an anchor should be a target bbox_type = 'neg' # this is the best IOU for the (x,y) coord And the current Anchor # note that this is different from the best IOU for a GT Bbox best_iou_for_LOc = 0.0 for bbox_num  in range(num_bboxes): # get IOU of the current GT box and the current anchor box curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]], [x1_anc, y1_anc, x2_anc, y2_anc]) # calculate the regression targets if they will be needed if curr_iou > best_iou_for_bbox[bbox_num] or curr_iou  > C.rpn_max_overlap: Cx = (gta[bbox_num, 0] + gta[bbox_num, 1]) / 2.0 cy = (gta[bbox_num, 2] + gta[bbox_num, 2]) 3]) / 2.0cxA = (x1_ANC + x2_ANC)/ 2.0CYA = (y1_ANC + Y2_ANC)/ 2.0tx = (cX-CXA)/ (x2_ANC-x1_ANC) ty = (cy-cya)/ (y2_anc - y1_anc) tw = np.log((gta[bbox_num, 1] - gta[bbox_num, 0]) / (x2_anc - x1_anc)) th = np.log((gta[bbox_num, 3] - gta[bbox_num, 2]) / (y2_anc - y1_anc))Copy the code

Two variables are defined, bbox_type and best_iou_for_LOc, which will be used later. The intersection of Anchor and GTA is calculated, which is relatively simple and will not be expanded. Then, if the intersection is greater than best_iOU_FOR_bbox [bbox_num] or greater than the threshold set by us, the center point coordinates of GTA and Anchor will be calculated, and then the gradient values of X, Y, W and H will be calculated by the center point coordinates and Bbox coordinates (I don’t know if this understanding is correct). Why do we calculate this gradient? Since the region calculated by RPN may not be very accurate, it can also be inferred from anchor with only 9 sizes, so we will carry out a regression calculation in the prediction, instead of directly using the coordinates of this region.

The next step is to mark anchor according to its performance.

if img_data['bboxes'] [bbox_num] ['class'] ! = 'bg':
	# all GT boxes should be mapped to an anchor box, so we keep track of which anchor box was best
	if curr_iou > best_iou_for_bbox[bbox_num] :
		best_anchor_for_bbox[bbox_num] = [jy. ix. anchor_ratio_idx. anchor_size_idx]
		best_iou_for_bbox[bbox_num] = curr_iou
		best_x_for_bbox[bbox_numTo:] = [x1_anc. x2_anc. y1_anc. y2_anc]
		best_dx_for_bbox[bbox_numTo:] = [tx. ty. tw. th]

	# we set the anchor to positive if the IOU is >0.7 (it does not matter if there was another better box, it just indicates overlap)
	if curr_iou > C.rpn_max_overlap:
		bbox_type = 'pos'
		num_anchors_for_bbox[bbox_num] + = 1
		# we update the regression layer target if this IOU is the best for the current (x,y) and anchor position
		if curr_iou > best_iou_for_loc:
			best_iou_for_loc = curr_iou
			best_regr = (tx. ty. tw. th)

	# if the IOU is >0.3 and <0.7, it is ambiguous and no included in the objective
	if C.rpn_min_overlap < curr_iou < C.rpn_max_overlap:
		# gray zone between neg and pos
		if bbox_type ! = 'pos':
			bbox_type = 'neutral'
# turn on or off outputs depending on IOUs
if bbox_type = = 'neg':
	y_is_box_valid[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
	y_rpn_overlap[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
elif bbox_type = = 'neutral':
	y_is_box_valid[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
	y_rpn_overlap[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0
elif bbox_type = = 'pos':
	y_is_box_valid[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
	y_rpn_overlap[jy. ix. anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1
	start = 4 * (anchor_ratio_idx + n_anchratios * anchor_size_idx)
	y_rpn_regr[jy. ix. start:start+4] = best_regr
Copy the code

This is provided that the class of the bbox is not ‘bg’, i.e. background. If the intersection is greater than the optimal value for this bbox, a series of updates are made. If the intersection is greater than the threshold we set, it is defined as a positive anchor, that is, there is a bbox with a high overlap and num_anchors added 1. If the intersection also happens to be greater than best_IOU_FOR_LOc, then best_regr is set to the current gradient. Here best_iOU_for_LOc refers to the optimal intersection under the anchor. My understanding is that if an anchor can match more than one Bbox as POS, then we take the gradient under best_iOU_for_LOc. To know this step, we only need to find the best selection. It doesn’t matter which class is in the constituency. If it is between the maximum threshold and the minimum threshold, then we are not sure whether it is the background or the object. We define it as neutral.

Next, mark this anchor according to bbox_type, y_is_box_VALID and Y_RPn_overlap respectively define whether this anchor is available and whether it contains objects.

for idx in range(num_anchors_for_bbox.shape[0]): if num_anchors_for_bbox[idx] == 0: # no box with an IOU greater than zero ... if best_anchor_for_bbox[idx, 0] == -1: continue y_is_box_valid[ best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], best_anchor_for_bbox[idx,2] + n_anchratios * best_anchor_for_bbox[idx,3]] = 1 y_rpn_overlap[ best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], best_anchor_for_bbox[idx,2] + n_anchratios * best_anchor_for_bbox[idx,3]] = 1 start = 4 * (best_anchor_for_bbox[idx,2] +  n_anchratios * best_anchor_for_bbox[idx,3]) y_rpn_regr[ best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], start:start+4] = best_dx_for_bbox[idx, :]Copy the code

Here comes another problem, many Bboxes may not find the desired anchor, so these training data cannot be used, so we use a compromise method to ensure that each Bbox has at least one anchor corresponding to it. The following is the specific method, which is relatively simple. For the Bbox with no corresponding anchor, choose the best one from the neutral anchor, on the premise that you cannot completely disassociate with me, that would be too much.

y_rpn_overlap = np.transpose(y_rpn_overlap, (2, 0, 1))
y_rpn_overlap = np.expand_dims(y_rpn_overlap, axis=0)

y_is_box_valid = np.transpose(y_is_box_valid, (2, 0, 1))
y_is_box_valid = np.expand_dims(y_is_box_valid, axis=0)

y_rpn_regr = np.transpose(y_rpn_regr, (2, 0, 1))
y_rpn_regr = np.expand_dims(y_rpn_regr, axis=0)

pos_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 1, y_is_box_valid[0, :, :, :] == 1))
neg_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 0, y_is_box_valid[0, :, :, :] == 1))

num_pos = len(pos_locs[0])
Copy the code

Next, a series of operations were carried out through numpy method to locate pos and NEG anchors.

num_regions = 256

if len(pos_locs[0]) > num_regions/2:
	val_locs = random.sample(range(len(pos_locs[0)), len(pos_locs[0]) - num_regions/2)
	y_is_box_valid[0. pos_locs[0] [val_locs]. pos_locs[1] [val_locs]. pos_locs[2] [val_locs]] = 0
	num_pos = num_regions/2

if len(neg_locs[0]) + num_pos > num_regions:
	val_locs = random.sample(range(len(neg_locs[0)), len(neg_locs[0]) - num_pos)
	y_is_box_valid[0. neg_locs[0] [val_locs]. neg_locs[1] [val_locs]. neg_locs[2] [val_locs]] = 0
Copy the code

As Negtive’s anchor is definitely much more than Postive’s, the maximum number of regions is set here, and samples of POS and NEG are uniformly sampled.

y_rpn_cls = np.concatenate([y_is_box_valid, y_rpn_overlap], axis=1)
y_rpn_regr = np.concatenate([np.repeat(y_rpn_overlap, 4, axis=1), y_rpn_regr], axis=1)

return np.copy(y_rpn_cls), np.copy(y_rpn_regr)
Copy the code

Finally, you get two return values y_RPn_cls and y_RPn_regr. Used to determine whether anchor contains objects and regression gradient respectively.

Let’s take a look at the structure of RPN layer in the network:

def rpn(base_layers,num_anchors):
    x = Convolution2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)

    x_class = Convolution2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)
    x_regr = Convolution2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)

    return [x_class, x_regr, base_layers]
Copy the code

Sliding over the anchor diagram with a 1*1 window generates a NUM_anchors number of channels, each containing (W *h) sigmoID activation values that indicate whether or not the Anchor is available, corresponding to the y_RPN_CLS we just calculated. Using the same method, we get x_regr corresponding to the y_rpn_regr we just computed.

After region proposals are obtained, another important idea is ROI, which can transform feature maps of different shapes into fixed shapes and send them to the full connection layer for final prediction. I’ll update it when I’m done. Due to their own learning process, may be a lot of understanding error, welcome to correct ~