That's entirely an implementation limitation from humans. There's no reason to believe a reasoning model could NOT be trained to stream multimodal input and perform a burst of reasoning on each step, interjecting when it feels appropriate.
Not sure training on language data will teach how to experiment with the social system like being a toddler will, but maybe. Where does the glance of assertive independence as the spoon turns get in there? Will the robot try to make its eyes gleam mischeviously as is written so often.