Yeah that's a fairly well studied one. Most of these techniques are rather "lossy" compared to extending the context window. The most likely "real solution" is going to be using various tricks and finetuning on higher context lengths to just extend the context window.
Yes! The obvious answer is to just increase your positions and train for that. This requires a ton of memory however (context length is squared) so most are currently training at 4k/8k and then finetuning higher similar to many of the image models.
However there's been some work that to "get extra milage" out of the current models so-to speak with rotary positions and a few other tricks. These in combination with finetuning is the current method many are using at the moment IIRC.
The bottleneck is quickly going to be inference. Since the current transformer models need the context length ^2, the memory requirements go up very quickly. IIRC a 4090 can _barely_ fit a 4bit 30B model in memory with 4096k context length.
From my understanding some form of RNNs are likely to be the next step for longer context. See RWKV as an example of a decent RNN https://arxiv.org/abs/2305.13048
Here's a bunch of other related methods,
Summarizing context - https://arxiv.org/abs/2305.14239
continuous finetuning - https://arxiv.org/pdf/2307.02839.pdf
retrieval augmented generation - https://arxiv.org/abs/2005.11401
knowledge graphs - https://arxiv.org/abs/2306.08302
augmenting the network a side network - https://arxiv.org/abs/2306.07174
another long term memory technique - https://arxiv.org/abs/2307.02738