We've done some transcription exercises. The way to get the timestamps to line i...

We've done some transcription exercises. The way to get the timestamps to line is : 1) break up the audio into minutes, and pass it in, one after another, so the chat completions prompt looks like: Here's minute 1 of audio [ffmpeg 1st minute cut out.wav] Here's minute 2 of audio [ffmpeg 2nd minute cut out.wav] Here's minute 3 of audio [ffmpeg 3rd minute cut out.wav] and so on..

The cutting step is simple, and token count is pretty much the same, but the crucial additional detail allows for excellent transcription fidelity time wise.

We've also experimented passing in regular TTS (non-llm) transcript for reference, which again helps the LLM do better.