Thanks, interesting reference. However, their analysis doesn't tell us much about the quality of Grokipedia. Would be more interested in something like hallucination density, but I know of no way that could be measured.
That is a certainty. I was once asked to calculate how much time we would save through our companies code reuse program. I read all the material on estimating savings, but then proved it was all ridiculous.
I came across a study that attempted to estimate how long it took to build libraries that had already been built. In this case, there were no unknowns, you had the entire code. Estimates were off by orders of magnitude. If we can't estimate the work when the work is already done, how could we ever estimate the work when we know less?
Not sure how you get around the contamination problems. I use these everyday and they are extremely problematic about making errors that are hard to perceive.
They are not reliable tools for any tasks that require accurate data.
I think it is creating a growing interest in authenticity among some. Although, it still feels like this is a minority opinion. Every content platform is being flooded with AI content. Social media floods it into all of my feeds.
I wish I could push a button and filter it all out. But that's the problem we have created. It is nearly impossible to do. If you want to consume truly human authentic content, it is nearly impossible to know. Everyone I interact with now might just be a bot.
> AI is not inevitable fate. It is an invitation to wake up. The work is to keep dragging what is singular, poetic, and profoundly alive back into focus, despite all pressures to automate it away.
This is the struggle. The race to automate everything. Turn all of our social interactions into algorithmic digital bits. However, I don't think people are just going to wake up from calls to wake up, unfortunately.
We typically only wake up to anything once it is broken. Society has to break from the over optimization of attention and engagement. Not sure how that is going to play out, but we certainly aren't slowing down yet.
For example, take a look at the short clip I have posted here. It is an example of just how far everyone is scaling bot and content farms. It is an absolute flood of noise into all of our knowledge repositories.
https://www.mindprison.cc/p/dead-internet-at-scale
John Dewey on a similar theme, about the desire to make everything frictionless and the role of friction. The fallacy that because "a thirsty man gets satisfaction in drinking water, bliss consists in being drowned."
> The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions.
> Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
> It is forgotten that success is success of a specific effort, and satisfaction the fulfilment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they are, or when taken universally.
I remember a few years back, here on HN everyone was obsessed with diets and supplements and optimizing their nutrients.
I remember telling someone that eating is also a cultural and pleasurable activity, that it's not just about nutrients, and that it's not always meant to be optimized.
It wasn't well received.
Thankfully these days that kind of posts are much less common here. That particular fad seems to have lost its appeal.
Oh yeah, it’s both funny and understandable how we’ve swung from the mania of huel-esque techbro belief of nutrition to the current holistic eating “beef tallow” and no-seed oils movement. I think we realized guzzling slop alone is spiritually empty.
The Culture dives into this concept with the idea of hegemonizing swarms, and Bolstrom touches on this with optimizing singletons.
Humans are amazing min/maxers, we create vast, and at least temporarily productive mono cultures. At the same time a scarily large portion of humanity will burn and destroy something of beauty if it brings them one cent of profit.
Myself I believe technology and eventually AI were our fate once we became intelligence optimizers.
> Myself I believe technology and eventually AI were our fate once we became intelligence optimizers.
Yes, everyone talks about the Singularity, but I see the instrumental point of concern to be something prior which I've called the Event Horizon. We are optimizing, but without any understanding any longer for the outcomes.
"The point where we are now blind as to where we are going. The outcomes become increasingly unpredictable, and it becomes less likely that we can find our way back as it becomes a technology trap. Our existence becomes dependent on the very technology that is broken, fragile, unpredictable, and no longer understandable. There is just as much uncertainty in attempting to retrace our steps as there is in going forward."
>but without any understanding any longer for the outcomes.
A concept in driving where your braking distance exceeds your view/headlight range at any given speed. We've stomped on the accelerator and the next corner is rather sharp.
Isaac Asimov did a fictional version of this in the Foundation trilogy.
That's a very idealistic view to believe there ever was something as a point were some people had a really more clear and precise representation which was accurate of what was going to come.
> However, I don't think people are just going to wake up from calls to wake up, unfortunately.
> We typically only wake up to anything once it is broken. Society has to break from the over optimization of attention and engagement.
I don't think anyone will be waking up as long as their pronouns are 'we' and 'us' (or 'people', 'society'). Waking up or individuation is a personal, singular endeavour - it isn't a collective activity. If one hasn't even grasped who one is, if one is making a category error and identifies as 'we' rather than 'I', all answers will fail.
I think it is actually worse than that. The hype labs are still defiantly trying to convince us that somehow merely scaling statistics will lead to the emergence of true intelligence. They haven't reached the point of being "surprised" as of yet.
> or how we would measure meaningful progress in this direction.
"First, we should measure is the ratio of capability against the quantity of data and training effort. Capability rising while data and training effort are falling would be the interesting signal that we are making progress without simply brute-forcing the result.
The second signal for intelligence would be no modal collapse in a closed system. It is known that LLMs will suffer from model collapse in a closed system where they train on their own data."
I agree that those both are very helpful metrics, but they are not a definition of intelligence.
yes, humans can learn to comprehend and speak language with magnitudes less examples than llms, however we also have very specific hardware for that evolved over millions of years — it's plausible that language acquisition in humans is more akin to fine-tuning in llms than training them from ground up. Either way, this metric is comparing apples to oranges when it comes to comparing real and artificial intelligence.
model collapse is a problem in ai that needs to be solved, and maybe it's even a necessary condition for true intelligence, though certainly not a sufficient one, and hence not an equivalent definition of intelligence either.
The bar you asked for was "meaningful progress". And as you state, "both are very helpful metrics", it seems the bar is met to the degree it can be.
I don't think we will see a definitive test as we can't even precisely define it. Other than heuristic signals such as stated above, the only thing left is just observing performance in the real world. But I think the current progress as measured by "benchmarks" is terribly flawed.
I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.
The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.
The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.
I think the last few years are a good example of how this isn't really true. Covid came around and everyone became an epidemiologist and public health expert. The people in charge of the US government right now are also a perfect example. RFK Jr. is going to get to the bottom of autism. Trump is ruining the world economy seemingly by himself. Hegseth is in charge of the most powerful military in the world. Humans pretending they know what they're doing is a giant problem.
They are different contexts of errors. Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
LLMs cannot do this. There are many types of human failures, but we somewhat know the parameters and context of those failures. Political/emotional/fear domains etc have their own issues, but we are aware of them.
However, LLMs cannot perform purely objective tasks like simple math reliably.
> Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
I’m not confident that this is so. Adult literacy surveys (see e.g. https://nces.ed.gov/use-work/resource-library/report/statist...) consistently show that most people can’t reliably interpret the meaning of complex or unfamiliar text. It wouldn’t surprise me at all if RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
Depends on the context. I've seen a lot of value from deploying LLMs in things like first-line customer support, where a suggestion that works 60% of the time is plenty valuable, especially if the bot can crank it out in 10 seconds when a human would take 5-10 minutes to get on the phone.
I'm not sure what you're referring to, since profitability wasn't a metric I used. I agree not all profitable things should exist, but increasing the availability of customer support seems to me like a clearly good thing.
Perhaps you're thinking that profit-chasing is the only reason companies don't offer good customer support today? That's not accurate. Providing enough smart, well-resourced human beings to answer every question your customers can come up with is a huge operational challenge, unless your product is absolutely dead simple or you're small enough to make random employees help in their spare time.
> I've seen a lot of value from deploying LLMs in things like first-line customer support, where a suggestion that works 60% of the time is plenty valuable
Valuable to whom? How so?
At that rate I would end my business with such a company.
If you claim such a terrible level of costumer support is valuable, I question your judgement of value.
Valuable to customers, because it allows them to get instant advice that will often solve their problem. I strongly suspect that some companies you do business with have already integrated LLMs into their customer support workflow - it's very common these days.
> most people can’t reliably interpret the meaning of complex or unfamiliar text
But LLMs fail the most basic tests of understanding that don't require complexity. They have read everything that exists. What would even be considered unfamiliar in that context?
> RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
These are areas where information can be contradictory. Even this statement is questionable in its most literal interpretation. Has he made such a statement? Is that a correct interpretation of his position?
The errors we are criticizing in LLMs are not areas of conflicting information or difficult to discern truths. We are told LLMs are operating at PhD level. Yet, when asked to perform simpler everyday tasks, they often fail in ways no human normally would.
> But LLMs fail the most basic tests of understanding that don't require complexity.
Which basic tests of understanding do state-of-the-art LLMs fail? Perhaps there's something I don't know here, but in my experience they seem to have basic understanding, and I routinely see people claim LLMs can't do things they can in fact do.
It is an example that shows the difference between understanding and patterns. No model actually understands the most fundamental concept of length.
LLMs can seem to do almost anything for which there are sufficient patterns to train on. However, there aren't infinite patterns available to train on. So, edge cases are everywhere. Such as this one.
I don't see how this shows that models don't understand the concept of length. As you say, it's a vision test, and the author describes how he had to adversarially construct it to "move slightly outside the training patterns" before LLMs failed. Doesn't it just show that LLMs are more susceptible to optical illusions than humans? (Not terribly surprising that a language model would have subpar vision.)
But it is not an illusion, and the answers make no sense. In some cases the models pick exactly the opposite answer. No human would do this.
Yes, outside the training patterns is the point. I have no doubt if you trained LLMs on this type of pattern with millions of examples it could get the answers reliably.
The whole point is that humans do not need data training. They understand such concepts from one example.