Funny thing is the structured output in the last example.
```
{
"reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
"finding": "Possible nil‑pointer dereference",
"confidence": 0.81
}
```
You know the confidence value is completely bogus, don't you?
{
"reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
"finding": "Possible nil‑pointer dereference",
"confidence": 0.81,
"confidence_in_confidence_rating": 0.54,
"confidence_in_confidence_rating_in_confidence_rating": 0.12,
"confidence_in_confidence_rating_in_confidence_rating_in_confidence_rating": 0.98,
// Etc...
}
When I was younger and more into music, when I went to a concert I would often judge if a drummer was "good" based on if they were better than me or not. I knew enough about drumming to tell how good someone was at the different parts of having that skill but also knew enough to know that I was not even close to having what it took to be a professional drummer.
This is what I feel like with this blogpost. I've barely scratched the surface of the innards of LLMs but even I know it should be completely obvious to anyone that has a product built around it that these confidence levels are completely made up.
I've never heard or used cubic before today but that part of the blog post, along with the obvious LLM generated quality of it, gives a terrible first impression.
elzbardico is pointing out how the author is having the confidence value generated in the output of the response rather than it being the confidence of the output.
this trick is being used by many apps (including Github copilot reviews). The way I see it, is that if the agent has an eager-to-please problem, then you give it a way out
i immediately noticed the same thing, but to be fair, we don't know if it's enriched by a separate service that checks the response and uses some heuristics to compute that value. If not, yeah, that is an entirely made up and useless value
you know everything is made up right? And yet it just works. I too use a confidence score in an bug finder app, Github seems to use them in copilot reviews, people will use them until it is shown not to work anymore.
> Sadly, this also failed. The LLMs judgment of its own output was nearly random. This also made the bot extremely slow because there was now a whole new inference call in the workflow.
``` { "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47", "finding": "Possible nil‑pointer dereference", "confidence": 0.81 } ```
You know the confidence value is completely bogus, don't you?