In a way, they already are. Look at the architecture of GANs, which use a secondary model to judge the quality of response from the first one. It's not used for training, but it's a good and simple example of how models can be composed to build something more advanced.
They used the larger and more expensive text-davinci-003 model to fine-tune the smaller and cheaper 7B LLaMA model.