Naive question, perhaps: What roles are multimodal models likely to play in the ...

Naive question, perhaps: What roles are multimodal models likely to play in the future of AI?

The comments here and in the Economist article seem to be about only large language models. In the initial announcement of GPT-4 in March, OpenAI described it as “a large multimodal model (accepting image and text inputs, emitting text outputs),” but they haven’t yet released the image part to the public.

What will happen when models are trained not only on text, and not only on text and images, but also on video, audio, chemical analyses of air and other substances in our surroundings, tactile data from devices that explore the physical world, etc.?