Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Funny thing is the structured output in the last example.

``` { "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47", "finding": "Possible nil‑pointer dereference", "confidence": 0.81 } ```

You know the confidence value is completely bogus, don't you?



Easy fix, just have the LLM generate:

    {
      "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
      "finding": "Possible nil‑pointer dereference",
      "confidence": 0.81,
      "confidence_in_confidence_rating": 0.54,
      "confidence_in_confidence_rating_in_confidence_rating": 0.12,
      "confidence_in_confidence_rating_in_confidence_rating_in_confidence_rating": 0.98,
      // Etc...
    }


Wasteful. `confidence`'s type should be Array<number>, wherein confidence[N] gives the Nth derivative confidence rating.


confidence all the way down


Confidence is all you need.


True in many situations in life.


When I was younger and more into music, when I went to a concert I would often judge if a drummer was "good" based on if they were better than me or not. I knew enough about drumming to tell how good someone was at the different parts of having that skill but also knew enough to know that I was not even close to having what it took to be a professional drummer.

This is what I feel like with this blogpost. I've barely scratched the surface of the innards of LLMs but even I know it should be completely obvious to anyone that has a product built around it that these confidence levels are completely made up.

I've never heard or used cubic before today but that part of the blog post, along with the obvious LLM generated quality of it, gives a terrible first impression.


I too once fell into the trap of having an LLM generate a confidence value in a response. This is a very genuine concern to raise.


Do you mean that there is no correlation between confidence and false positives or other errors?


elzbardico is pointing out how the author is having the confidence value generated in the output of the response rather than it being the confidence of the output.


Is there research solid knowledge on this?


this trick is being used by many apps (including Github copilot reviews). The way I see it, is that if the agent has an eager-to-please problem, then you give it a way out


Thanks. I was talking about the confidence measure.


Could you have a higher-order reasoning LLM generate a better confidence rating? That's how eval frameworks generally work today


i immediately noticed the same thing, but to be fair, we don't know if it's enriched by a separate service that checks the response and uses some heuristics to compute that value. If not, yeah, that is an entirely made up and useless value


you know everything is made up right? And yet it just works. I too use a confidence score in an bug finder app, Github seems to use them in copilot reviews, people will use them until it is shown not to work anymore.

on the other hand this post https://www.greptile.com/blog/make-llms-shut-up says that it didn't work in their case:

> Sadly, this also failed. The LLMs judgment of its own output was nearly random. This also made the bot extremely slow because there was now a whole new inference call in the workflow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: