Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Appreciate any tips on the subject