PaLM-E An Embodied Multimodal Language Model
Key Takeaways
- Demonstrate that a generalist, transfer-learned, multi embodiment decision-making agent can be trained via mixing in embodied data into the training of a multimodal LLM
- introduce novel architectural ideas such as neural scene representations and entity-labelling multimodal tokens.
- Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text.
- The main idea is to therefore inject tokens from images and sensor modalities into the language embedding space of a pre-trained language model.
- PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding. This enables direct integration of the rich semantic knowledge stored in pre-trained LLMs into the planning process.