> But the 120b model has just as bad if not worse formatting issues, compared to the 20b one
What runtime/tools are you using? Haven't been my experience at all, but I've also mostly used it via llama.cpp and my own "coding agent". It was slightly tricky to get the Harmony parsing in place and working correct, but once that's in place, I haven't seen any formatting issues at all?
The 20B is definitely worse than 120B for me in every case and scenario, but it is a lot faster. Are you running the "native" MXFP4 weights or something else? That would have a drastic impact on the quality of responses you get.
Edit:
> Migth also be because of 120b not liking being in q8
Yeah, that's definitely the issue, I wouldn't use either without letting them be MXFP4.
What runtime/tools are you using? Haven't been my experience at all, but I've also mostly used it via llama.cpp and my own "coding agent". It was slightly tricky to get the Harmony parsing in place and working correct, but once that's in place, I haven't seen any formatting issues at all?
The 20B is definitely worse than 120B for me in every case and scenario, but it is a lot faster. Are you running the "native" MXFP4 weights or something else? That would have a drastic impact on the quality of responses you get.
Edit:
> Migth also be because of 120b not liking being in q8
Yeah, that's definitely the issue, I wouldn't use either without letting them be MXFP4.