The current state of things and progress in Robotics and What is to come soon

6 min readJan 5, 2024

Introduction

Hello everyone! Once again, it has been a while since my last article was published on Medium. A lot has changed: I have become an Assistant Professor at NaUKMA and started my PhD in CS, with a research focus on Reinforcement Learning and Robotics. And this is going to be my first article on the domain I want to dedicate myself to at least for another 4 years — Robotics. I have been actively familiarizing myself with different tasks and benchmarks, robots, and setups for research. I have not yet decided if I want to pursue manipulation or navigation tasks.

Well, I’m more prone to manipulation being inspired by DeepMind’s recent summary of their methods and results [1] for RT-2, RT-Trajectory, and SARA [7], but I haven’t drafted and ordered my robotic parts yet, so there is time to think :)

Deep RL or VLA

The central dilemma of the current situation in manipulation and navigation tasks of robotics, the way I see it, is whether to use refined Deep Reinforcement Learning approaches or to make use of recent state-of-the-art pipelines and models.

Vision. Language. Action.

The latter include such cornerstones as large foundation models like PaLI-X [2], a Vision-Language Model (VLM) by Google Research; and the technique of action space discretization that aligns the robot data with a VLM, namely within the RT-2 [3] framework. This setup forms a Vision-Language-Action (VLA) model, which I shall go over in a bit. To put it simply, the policy for robotic control is being set up via VLM accepting an image from the camera and explaining the surroundings for LLM, LLM, in turn, generates instructions and actions, filters them, and then the robot performs the task, having been trained on a large-scale diverse task dataset. A lot is going on and the system is quite complex, but it works very well in practice, although it is not yet a completely finalized product.

Deep Reinforcement Learning

On the other hand, we have a well-known application of Reinforcement Learning (RL), i.e. robotics. The most standard RL type is online learning which doesn’t take advantage of any pretraining, can generalize poorly on out-of-distribution cases, and doesn’t have a high variance with the training dataset.

Offline RL

There is also offline RL that leverages large-scale RL datasets to pretrain a policy, and then perfect it over time during fine-tuning. Under different scenarios, offline RL is quite beneficial, especially its modifications, e.g. the authors of Conservative Q-Learning (CQL) [4] paper achieved a substantial outperformance of existing offline RL methods, often learning policies that attain 2–5 times higher final return. While offline RL is quite expensive in terms of high data requirements, it helps alleviate issues with out-of-distribution generalization, eases policy convergence, and makes it much more robust.

A typical RL-based robotic control system involves setting up a policy, selecting a certain algorithm like Soft-Actor Critic, some variant of Q-learning, PPO, DDPG, or something else entirely, and maximizing the reward function of the policy by punishing bad decisions and rewarding good ones. Of course, the visual control part is also a part of the system. And then hopefully the policy converges into an optimal one. There is nothing wrong with RL for robotic control. There are even great examples of recent its usage in robotics, e.g. NoMaD [5] introduces diffusion policy that allows modeling highly complex multimodal distributions (for instance, when the robot is at a junction, the policy might need to assign high probabilities to the left and right turns, but low probability to any action that might result in a collision).

Gap Between the Two

The problem with current RL in robotics is that it is not yet on such a powerhouse level as VLAs. It’s a question of time, I’m quite sure we will a rise in the number of papers and approaches that focus heavily on RL in some time, but for now, the power of Vision Transformers, Text Transformers, and different ways to optimize them is a dealbreaker which allows producing powerful results and evolving the robotics domain further (something we will be seeing a lot in 2024).

That concludes my thoughts on the matter and some brief updates. Now I’d like to give a very quick overview of what Google DeepMind shared with us in [1]. I will go over RT-2 and SARA highlighting the most important takeaways.

RT-2: Vision-Language-Action Models

TL;DR: Visual perception, VLM generates a description for LLM, LLM responds with a set of commands taking into consideration different objects that are seen, robot performs the command (with or w/o Chain-of-Thought). PaLI-X 5B is a great backbone, while PaLI-X 55B is the most powerful one; Co-fine-tuning > fine-tuning > from scratch.

Co-fine-tuning

The authors introduce co-finetuning (a combination of fine-tuning and co-training where they keep some of the old vision & text data around) of an existing VLM with robot data. The robot data includes the current image, language command, and the robot action at the particular time step.

To make RT-2 easily compatible with large, pre-trained vision-language models, authors have a simple approach: they represent robot actions as another language, which can be cast into text tokens and trained together with Internet-scale vision-language datasets. Essentially discretizing the actions and feeding them into Language Model.

Fig 1. Actions are represented as text strings and can be considered as commands for the robot. This simple representation makes it straightforward to fine-tune any existing VLM and turn it into a VLA. During inference, the text tokens are de-tokenized into robot actions, enabling closed-loop control (https://robotics-transformer2.github.io/).

Fig 2. This is what the key idea looks like: discretize actions by tokenizing them; as they are easily processed by large transformer models, de-tokenize them in the end for a meaningful output command for a robot (https://robotics-transformer2.github.io/).

Main architecture choices:

The model size: 5B vs 55B for the RT-2 PaLI-X variant
Training recipe: training the model from scratch vs fine-tuning vs co-fine-tuning.

Fig 3. RT-2 performance on different benchmarks against RT-1 and VC-1. There are two backbones for RT-2 — PaLM-E and PaLI-X (https://robotics-transformer2.github.io/).

Fig 4. The difference in performance in unseen scenarios for different ways of training. Co-fine-tuning is a clear leader (https://robotics-transformer2.github.io/).

RT-2 can exhibit signs of chain-of-thought reasoning similar to vision-language models. We qualitatively observe that RT-2 with chain-of-thought reasoning can answer more sophisticated commands because it is given a place to plan its actions in natural language first.

SARA: Self-Adaptive Robust Attention

TL;DR: there is a type of attention mechanism from 2020 — Performer-ReLU attention [6] which seemingly reduces the quadratic complexity of attention computation to a linear one. The authors introduce Performers attention, and the use of up-training which yields a 14% speed increase and >10% performance increase within the RT-2 framework. SARA does well both on PaLI-ViT (spatial) and PCT (point-cloud) tasks.

Up-training

Either during the pre-training or fine-tuning of the original model, attention replacement takes place. The training continues but with an efficient attention mechanism.

In the RT-2 setting, efficient attention is replacing regular attention in the ViT-encoder tower of the PaLI model. While combined with other methods, such as keeping a short history of frames and applying new action tokenization techniques, SARA provides accuracy gains and substantial speed-ups.

**Fig 5.** PaLI-X backbone of RT-2 scheme. Frames are encoded via SARA variants of the ViTs (sViT). Text instruction is separately pre-processed by the text Transformer (TT). In the fuser, all resulting embeddings are concatenated and interact with each other via self-attention (https://sites.google.com/view/rtsara/).

Fig 6. Left: Speed tests for PCT benchmark. Right: Speed tests (on a CPU) for PaLI-ViT (https://sites.google.com/view/rtsara/).

Conlusions

I hope you had a nice read and enjoyed the little overview of Robotic Transformers and their components along with some thoughts of mine on Reinforcement Learning for Robotics! Don’t forget to follow if you liked the content, there will be more. Cheers!

References

Shaping the future of advanced robotics, The Google DeepMind Robotics Team.
PaLI, Google Research.
RT-2, Google DeepMind.
Conservative Q-Learning for Offline Reinforcement Learning.
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration.
Rethinking Attention with Performers, 2020.
SARA: Self-Adaptive Robust Attention.