Are the inputs and the responses in a non English language? LLM APIs can get costly for non English, sometimes as much as 10x due to more tokens being consumed. Not sure what's the solution here.
Also, maybe you can use sometime kind of caching combined with some mbeddongd search to serve the previous response, if the input is similar above a certain threshold.
Also, maybe you can use sometime kind of caching combined with some mbeddongd search to serve the previous response, if the input is similar above a certain threshold.