1 the introduction

CoRL (Conference on Robot Learning) is a top Conference on Robot Learning established last year. This year is the second one. The homepage of the Conference is:

http://www.robot-learning.org/www.robot-learning.org

More than 70 papers were included in this conference, including many papers related to mechanical arms. In this Blog, we will appreciate the paper and see where the cutting-edge of Robotic Manipulation/Grasping is.

2 Paper List

[1] Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

[2] Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

[3] Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

[4] Reinforcement Learning of Active Vision for Manipulating Objects under Occlusions

[5] Qt-opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

[6] Sim-to-Real Reinforcement Learning for Deformable Object Manipulation

[7] Task-Embedded Control Networks for Few-Shot Imitation Learning

[8] SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark

[9] ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation


3 Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

This paper is from Google Brain and Sergey Levine’s team at UC Berkerley.

https://sites.google.com/site/grasp2vec/sites.google.com

The task of this paper is goal-conditioned, that is, to give the mechanical arm a picture of an object as the goal, the mechanical arm must be able to grasp the corresponding object. Apart from the methodology of this paper, we have a very direct solution to this problem:

1) Detect the objects first and frame each object

2) Match the picture of goal to find the corresponding object

3) Train the mechanical arm to grasp the object according to the input object characteristic information

However, the trouble of this method is that it requires a lot of manual annotation, which is very time-consuming and laborious. Although existing object detection algorithm can be used, it is not for the scene after all, and there must be deviations. Therefore, the main consideration of this paper is whether there is a way to learn the representation information of this object through self-supervision, so as to facilitate grasping.

The idea is simple: there are just so many objects in the frame. If we take away one object, one object will be lost. We can use this change to realize self-supervised representation learning:


To construct this self-supervised Loss, the author makes an interesting assumption:


That is, the feature of the image observed before the grab minus the feature of the image after the grab is exactly the same as the feature of the object taken by the manipulator. It’s a hypothesis that doesn’t seem entirely plausible, but experiments have shown it to be ok.

As you can see, self-supervised learning in this way doesn’t have a lot of tagging costs, you just have to do repeated experiments to get the data.

After the training, the author showed us the effect of self-supervision through heatmap, directly realizing the positioning effect achieved by object detection and matching:


Based on such feature extraction ability, general Q learning is adopted for subsequent training.

A little comment:

1) Innovation degree of idea: ⭐️⭐️ ️⭐ ⭐

2) Practical value: ⭐️⭐️ port board can greatly reduce manual labeling work, but considering that manual labeling data is not very expensive, maybe we can achieve better results by manual labeling

3) Difficulty of reproduction: ⭐️ very difficult

As we know, Google Brain has a robot farm, so it’s extremely difficult to reproduce a paper from Google Brain.

4 Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

This paper comes from MIT and has the same research focus as the last one, which focuses on the learning of object representation. However, this paper is stronger than the last one and won the Best Paper Award of CoRL this year. Relevant media reports and codes have been released before. So let’s take a look at what this paper has to offer.

RobotLocomotion/pytorch-dense-correspondencegithub.com

Let’s look at the methodology in detail.

The core idea of this paper is to build a pixel-level image Descriptor, i.e. Dense Visual Object Descriptor. With this image description, no matter how the Angle of the image the camera sees changes, or even how the object itself deforms, it can maintain the invariance of the description. With this foundation, you can use it to do the grasping of objects.

So how do you do that?

The author uses ResNet to construct a Dense Object Net. The input image dimension is WxHx3, and the output dimension is WxHxD, where D is the feature expression dimension corresponding to each pixel. To illustrate what this D means, let’s look at how we use this WxHxD, and then how we train it.

We construct a distance to calculate the feature difference between two images before different pixels, namely normal L2 distance:


F () is the Dense Object Net, and F (I)(u) is the 1x1xD corresponding to pixel U.

So, for example, we’re going to do position-specific grabbing, where the user specifies a position like the heel of the shoe, and then the robot arm is going to grab from the heel of the shoe. At this time, we can compare the feature of the pixel of the picture position specified by the user with the feature of the image seen in real time by the current mechanical arm, and find the pixel position with the smallest distance from D, which is considered to be the corresponding heel position. This is the direct embodiment of the value of Dense Object Descriptor, which can achieve super precise positioning, which was not possible in the last paper.

So how to train the Dense Object Net?

The idea adopted is relatively simple, called self-supervised pixelwise contrastive loss. In other words, we find the corresponding matching point and mismatching point of two different pictures of the same object, so that the distance between matching point and non-matching point is the minimum, and the distance between non-matching point and mismatching point is the maximum.


However, in order to get these matching points and mismatching points, we need to send RGBD cameras, first conduct a 3D reconstruction of the object, and then search for these points based on the 3D reconstruction of the object. The computational cost here is a little bit higher, so to construct a description of the object, you have to have the robot go around the object. This is another technology in itself, and of course it is mature, so it is also applied in Huawei Mate 20 Pro.

And then the important thing about this method is that task-agnostic or generalization is strong, so even new objects and even different objects of the same class can be distinguished.

A little comment:

Degree: 1) the idea innovation ⭐ ️ ⭐ ️ ⭐ ️ ⭐ ⭐ this method is not directly construct point cloud point cloud data, and is aimed at 2 d image to construct the corresponding characteristics of expression, is very different, can achieve such good results in the pixel level is a bit unbelievable.

2) Practical value: ⭐️⭐️ ️ ️⭐⭐ ⭐⭐ ️ ️ ️ ️ ️ ️⭐⭐ ⭐⭐ ⭐⭐ ️ ️ ️ ️ ️⭐⭐ ⭐⭐ ⭐⭐ ⭐️ ️ ️ ️ ️⭐⭐ ⭐⭐ ⭐⭐ ⭐⭐ ️ ️ ️ ️⭐⭐ ⭐⭐ ⭐ And because each object can store a feature description independently, this method can be widely used, so it is very valuable.

3) Difficulty of reproduction: ⭐️⭐️ It is still difficult to reproduce the whole process completely. From 3D reconstruction to obtaining matching points to training neural network, a large amount of engineering is required. However, the author open source the neural network behind the part, but also provides the data set of matching points, can be said to have given you a lot of help.

5 Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

This paper still continues the theme of the previous two papers, the core of which is visual. Therefore, we can understand that the application of computer vision in robots still has a lot of development space.


The grasping problem of mechanical arm can be directly divided into two parts: visual end and mechanical end. Google tends to be more End-to-End, but it’s often better to separate problems out. And this may be the visual side of the problem is more important, after all, as long as we can distinguish objects out of the mechanical side of the traditional control method can also be achieved. And that’s why we see so many purely visual studies.

Back to this paper, the idea is different: use synthetic simulation data to learn attitude estimation of objects.


The core contribution of this paper mainly lies in the use of photorealistic simulation images. To put it bluntly, the simulation degree is higher, so the effect is better. As for the training details of neural networks, we will analyze them in detail here.

1) Idea innovation: ⭐️⭐️ Maybe not too much innovation, but the effect is good.

2) Practical value: ⭐️⭐️ college (⭐⭐) ⭐⭐ ️⭐ college (⭐⭐) ⭐⭐ (⭐️) college (⭐

3) Difficulty of reduplication: ⭐️⭐️ ️ ️⭐ (4) There are no major problems in the algorithm of this article, mainly in data set. Data makes everything easier.

6 Reinforcement Learning of Active Vision for Manipulating Objects under Occlusions


The research direction of this paper is different from the previous one. One way to do research is to actively study the problems that everyone is considering, and the other way is to construct new problems. In robot learning, new problems are easy to create. This paper is an example. Generally, we only study the motion of the mechanical arm, so can the camera also move? We’re like, what’s the point of this camera move? There is, for example, when the object is obscured. If it makes sense, then it can be studied. This kind of research is basically standard hydrology routine. There is not much innovation in the method, but it is innovation to change the problem a little bit. The other standard hydrologic approach is reversed, with the same problems and a slightly different methodology. In this paper, standard actor-critic is used in the method, except for some changes in the input and output due to the need to control the camera and the robot arm, so I will not analyze it in detail here.

1) Idea innovation: ⭐️⭐️

2) Practical value: ⭐️⭐️ college

3) Difficulty of reoccurrence: ⭐️⭐️ doctor ⭐

7 Qt-opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

This paper won CoRL’s Best Systems Paper Award. We have already analyzed it in our previous blog, so we won’t re-analyze it here.

Flood Sung: Where are the frontiers of chatbot crawling?zhuanlan.zhihu.com

The reason why this paper attracts attention is not how innovative the method is, but that the grasping effect reaches a very high level through large-scale end-to-end training. We’re going to think about why do we need neural support for robotic arms? What are the disadvantages of using traditional controls? From this paper, we can see that neural networks can be used to learn control strategies that cannot be learned by traditional control, which is enough. In the long run, End-to-End is better than non-end -to-End. 1) Idea innovation degree: ⭐️⭐️ ️⭐ ⭐

⭐️⭐ ⭐️⭐️

3) Difficulty of recurrence: ⭐️

8 Sim-to-Real Reinforcement Learning for Deformable Object Manipulation

This paper is similar to Reinforcement Learning of Active Vision for Manipulating Objects under Occlusions, both of which study a different problem. This paper studies the grasping of deformable objects, such as towels.


https://sites.google.com/view/sim-to-real-deformablesites.google.com

The implementation effect is good, using the improved version of DDPG, including:

1) Prioritised Replay

2) N – Step returns

3) DDPGfD

4) Behavioural Cloning

4) Reset to demonstration

5) TD3

6) Asymmetric actor-critic

It uses the rL algorithm for off Policy Continuous Control developed by Deepmind.

1) Idea innovation: ⭐️⭐️

2) Practical value: ⭐️⭐️ college

3) difficulty of reoccurrence: ⭐️⭐️ ️⭐ ⭐

9 Task-Embedded Control Networks for Few-Shot Imitation Learning

This paper is worth talking about, because it studies a very important issue “little-shot Imitation Learning”. This is a question that Chelsea Finn’s MAML has previously addressed. However, MAML can not support large-scale network and inconvenient training problems are very bug, so this paper kind of beat down the MAML method. The key is that the method is very simple, using deep metric learning method, for each task to construct a embedding:


Let the embedding of different tasks be separated and the embedding of the same task be as close as possible. Similarly, demo only needs the image state data, not the corresponding action data. When I first saw the method of this paper, I had a question. My question was how to input a long string of demo data into the network?

As a result, this paper only uses the images at the beginning and end of the demo. Considering that the task constructed by few-shot Imitation Learning is really too simple, that is, placing an object in a bowl, changing the type of the object and the style of the bowl, and it doesn’t matter how to capture the middle. So it’s not so much a task as a goal. In fact, it is only a goal embedding, which does not involve much imitation learning.

So, an interesting thing happens: The current experimental design of Few-Shot Imitation Learning is not very good. It is necessary to design more complex experiments, at least for the robot to observe the middle of the demo. It’s not imitation.

1) Idea innovation: ⭐️⭐️ college

2) Practical value: ⭐️⭐️ college

3) Difficulty of recurrence: In general, the biggest significance of this paper is not to say that its effect beats MAML, after all, MAML has been abused to slag in fee-shot Learning. However, it is necessary to say that 1) the method of few-shot Learning can be transferred to the application of robots, and 2) this problem is still very cutting-edge, so that the experimental design is too simple, and there is a great space for development.

10 SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark

11 ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

The last two papers are from Fei-Fei Li’s research, which is very characteristic of her, that is, building platforms.

SURREAL


There’s a real lack of benchmarks for robot learning problems. In particular, ROBOTURK collects imitation learning data on a large scale in a very cool way, which is really interesting. At present, Fei-Fei Li has also set up a special robot laboratory, which is worth paying attention to:

http://pair.stanford.edu/

12 point summary

The above basically analyzes the most cutting-edge paper in the field of Robotic Manipulation, from which we can see

1) The current level of research is at least for arbitrary objects, rather than fixed types of objects. With the help of computer vision deep learning, I believe this part can be further improved.

2) The research platform is improving, and more and more leaders will enter the field of robot learning, which will be the next explosion after all. It’s hard to get people to work on things like Google Brain that don’t even have the hardware to replicate.

3) There is no significant improvement on the algorithm level, but this does not affect the improvement of the effect. At present, the factors of platform restrict the play of algorithm to a large extent.

Overall, the development of this field is very exciting, and I look forward to the next development!