Funny thing is the structured output in the last example. ``` { "reasoning": "`c...

munificent · 2025-06-26T17:05:54 1750957554

Easy fix, just have the LLM generate:

    {
      "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
      "finding": "Possible nil‑pointer dereference",
      "confidence": 0.81,
      "confidence_in_confidence_rating": 0.54,
      "confidence_in_confidence_rating_in_confidence_rating": 0.12,
      "confidence_in_confidence_rating_in_confidence_rating_in_confidence_rating": 0.98,
      // Etc...
    }

paisawalla · 2025-06-27T17:57:59 1751047079

Wasteful. `confidence`'s type should be Array<number>, wherein confidence[N] gives the Nth derivative confidence rating.

zengid · 2025-06-26T17:51:20 1750960280

confidence all the way down

GardenLetter27 · 2025-06-26T18:33:25 1750962805

Confidence is all you need.

lgas · 2025-06-26T20:32:15 1750969935

True in many situations in life.

skipants · 2025-06-26T20:23:44 1750969424

When I was younger and more into music, when I went to a concert I would often judge if a drummer was "good" based on if they were better than me or not. I knew enough about drumming to tell how good someone was at the different parts of having that skill but also knew enough to know that I was not even close to having what it took to be a professional drummer.

This is what I feel like with this blogpost. I've barely scratched the surface of the innards of LLMs but even I know it should be completely obvious to anyone that has a product built around it that these confidence levels are completely made up.

I've never heard or used cubic before today but that part of the blog post, along with the obvious LLM generated quality of it, gives a terrible first impression.

ramity · 2025-06-26T16:34:01 1750955641

I too once fell into the trap of having an LLM generate a confidence value in a response. This is a very genuine concern to raise.

sharkjacobs · 2025-06-26T16:33:21 1750955601

Do you mean that there is no correlation between confidence and false positives or other errors?

ramity · 2025-06-26T16:44:51 1750956291

elzbardico is pointing out how the author is having the confidence value generated in the output of the response rather than it being the confidence of the output.

bckr · 2025-06-26T19:38:51 1750966731

Is there research solid knowledge on this?

baby · 2025-06-26T21:11:04 1750972264

this trick is being used by many apps (including Github copilot reviews). The way I see it, is that if the agent has an eager-to-please problem, then you give it a way out

bckr · 2025-06-27T16:06:20 1751040380

Thanks. I was talking about the confidence measure.

MattSayar · 2025-06-26T19:50:57 1750967457

Could you have a higher-order reasoning LLM generate a better confidence rating? That's how eval frameworks generally work today

volkk · 2025-06-26T18:10:50 1750961450

i immediately noticed the same thing, but to be fair, we don't know if it's enriched by a separate service that checks the response and uses some heuristics to compute that value. If not, yeah, that is an entirely made up and useless value

baby · 2025-06-26T21:10:03 1750972203

you know everything is made up right? And yet it just works. I too use a confidence score in an bug finder app, Github seems to use them in copilot reviews, people will use them until it is shown not to work anymore.

on the other hand this post https://www.greptile.com/blog/make-llms-shut-up says that it didn't work in their case:

> Sadly, this also failed. The LLMs judgment of its own output was nearly random. This also made the bot extremely slow because there was now a whole new inference call in the workflow.