Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Levels of AGI: Operationalizing Progress on the Path to AGI (arxiv.org)
89 points by olalonde on Nov 13, 2023 | hide | past | favorite | 74 comments


Based on their definitions (https://imgur.com/a/Ta848Lu)

- We already are at Level 1+ with GPT 4, but they are basically assistants and not truly AGI.

- Level 2 "Competent level" is basically AGI (capable of actually replacing many humans in real world tasks). These systems are more generalized, capable of understanding and solving problems in various domains, similar to an average human's ability. The jump from Level 1 to Level 2 is significant as it involves a transition from basic and limited capabilities to a more comprehensive and human-like proficiency.

However, the exact definition is tautological - capabilities better than 50% of skilled adults.

So IMO the paper basically states in a lot of words that we not at AGI and is restating the common understanding of AGI to be "Level 2 Competent", but doesn't otherwise really add to understanding of AGI.


AGI now means many different things to many different people. I don't think that's really any "common definition" anymore.

For some, it's simply Artificial and Generally Intelligent(perform many tasks, adapt). For some, it might mean any that it needs to do everything a normal human can. For some, non-biological life axiomatically cannot become AGI. For some, it must be "conscious" and "sentient".

For some, it might require literal omniscience and omnipotence and accepting anything as AGI means, to them, that they are being told to worship it as a God. For some, it might mean something more like an AI that is more competent than the most competent human at literally every task.

For some, acknowledging it means that we must acknowledge it has person-like rights. For some it cannot be AGI if it lies. For some it cannot be AGI if it makes any mistake. For some it cannot be AGI until it has more power than humans. These are several definitions and implications that are partially or wholly mutually conflicting but I have seen different people say that AGI is each different one of those.


I've got a much simpler definition: an AGI should be able to produce a better version of itself.

I'm not saying this would necessarily lead to the technological singularity: maybe it's somehow a dead end. Maybe the "better version of itself, which itself shall built a better version of itself" will be stuck at some point, hitting some limit that'd still make it less intelligent than the most intelligent humans. That I don't know.

But what I know is that an AI that is incapable of producing a better version of itself is less intelligent than the humans who created it in the first place.


I actually really like this definition and will be giving it some thought. But right off the bat, that’s not how most people will see it - and so while this definition is certainly thought-provoking and useful, it doesn’t specify much that’s relatable to other tasks and therefore I think will always be a niche definition.

An AI that can make a better version of itself may not be able to communicate in any human language for example; and that is now a de facto requirement for most people to see something as AI I think.


> I'm not saying this would necessarily lead to the technological singularity:

You kinda are though: if it hits a limit and can no longer make a better version of itself, then your definition means the final one in the sequence isn't an AGI even though it's worse parent is.

> But what I know is that an AI that is incapable of producing a better version of itself is less intelligent than the humans who created it in the first place.

Neither necessary nor sufficient:

(1) they are made by teams of expert humans, so an AGI could be smarter than any one of them and still not as capable as the group (kinda like how humans are smarter than evolution, but not smart enough to make a superhuman intelligence even though evolution made us by having lots of entities and time)

(2) one that can do this can still merely be a special-purpose AI that's no good at anything else (like an optimising compiler told to compile its own source code)

(3) what if it can only make its own equal, being already at some upper limit?


> an AGI should be able to produce a better version of itself.

So humans do not have GI?


They do. 99% of what you think as human intelligence is social, and has been obtained by previous generations and passed to the person. In a sense, we are hugely overfitted on distilled knowledge, actual biological capabilities are much less impressive.


As a somewhat narrow counter-example, genetic algorithms are able to produce better versions of themselves but do not qualify as AGI.


Are there any examples of genetic algorithms producing genetic algorithms outside of nature?

The classic synthetic way is a genetic algorithm producing increasingly better outputs.


Okay but in reality an AGI is just an agent that can learn new things and reapply existing knowledge to new problems. Generally intelligent, it doesn’t have to mean anything more nor does it imply godlike intelligence.

Everything you just mentioned seems to be some philosophy of sentience or something. A few years ago when ANNs become popular for everything, general intelligence just meant “can do things it wasn’t explicitly trained on”


This definition is itself tautological and also quite flawed. For example at what point in this machine’s development has it attained AGI? What if it learns to/is taught to stop learning? What if the machine is not capable of, e.g. math? What kind of knowledge is legitimate vs illegitimate? In many ways the concept of AGI masks a fundamental social context of the machine to obey standards and only adopt the “correct” knowledge. This is why, e.g. instruction tuning or RLHF was such a leap for the perception of intelligence, because the machines obeyed a social contract with users that was designed into them


This sounds a lot like "If we throw out everyone else's definition, then my definition is the obviously correct one".

Can you give any reason why your definition is correct, and/or why all those others should be dismissed?


This is one of the problems we've had with intelligence we've had for a very long time. We've not been able to break it down well into distinctive pieces for classification. You either have all the pieces of human intelligence, or you're not intelligent at all.


> capabilities better than 50% of skilled adults.

That sents the bar unreasonably high in my opinion. Almost all of humanity does not have skills 'better than 50% of skilled adults' by definition and those definitely qualify as generally intelligent.


It's also rather vague and at least in my first pass skim I'm not seeing them define what it means to be skilled or unskilled. So I'm not sure the metric is even meaningful without this because it's not like you're one day unskilled in driving a car and the next day you're "skilled." Does that mean anyone with a drivers license? Does that mean a professional driver? We talking taxi driver, nascar, rally racing, F1? What? Skills are on a continuous distribution and our definitions of skilled vs unskilled are rather poorly defined and typically revolve around employment rather than capabilities.

I hope I just missed it because the only clarification I saw was with this example

> e.g., “Competent” or higher performance on a task such as English writing ability would only be measured against the set of adults who are literate and fluent in English


That's the whole problem with all these definitions: they are rooted in very imprecise terms whose meaning seems to depend on the beholder of the prose.


I don't see the issue. They listed different levels and that's one of them. Level 1 compares to unskilled humans.


Pretty much every adult has some kind of skills, so this means half of (adult) humanity have skills better than 50% of skilled adults.


> Based on their definitions (https://imgur.com/a/Ta848Lu)

Wow, what a massive jump between Level 0 and Level 2. They state that their goal is to help facilitate conversation within the community but I feel like such a gap very obviously does not help. The arguments are specifically within these regions and we're less concerned about arguing if a 50th percentile general AI is AGI vs a 99th percentile. It's only the hype people (e.g. X-risk and Elon (who has no realistic qualifications here)) who are discussing those levels.

I know whatever you define things as people will be upset and argue, but the point of a work like this is to just make something we can all point to and be a place to refine from, even if messy. But with this large of a gap it makes such refinement difficult and I do not suspect we'll be using these terms as we move forward. After all, with most technology the approach is typically slow before accelerating (since there's a momentum factor). A lot of the disagreement in the community is if we're in the bottom part of the S curve or at the beginning of the exponential part. I have opinions but no one really knows, so that's where we need flags placed to reduce stupid fights (fights that can be significantly reduced by recognizing we're not using the same definitions but assuming the other person is).


> Level 2 "Competent level" is basically AGI (capable of actually replacing many humans in real world tasks).

I still find that a weird definition for two reasons:

1. All the narrow systems already have replaced humans in many real-word tasks.

2. I'm dubious there is much difference between a level 2 and level 4 general AI. And outperforming a % of humans seems like an odd metric, given that humans have diverse skill sets. A more sane metric would be Nth percentile for M% of tasks human workers perform in 2023.


> The jump from Level 1 to Level 2 is significant

That's a speculative claim because we don't really know what's involved. We could be one simple generalization trick away, which wouldn't be very significant in terms of effort, just effect.


> However, the exact definition is tautological - capabilities better than 50% of skilled adults.

It's clear that they mean human capabilities unaugmented by AI.


Mediocre human armed with modern smartphone with connectivity, a 100 popular apps, youtube wealth of knowledge and search engine is clearly in 99-th percentile in most areas of life compared, for example, with people armed with tech from 1940-s. Search engine itself is AI in this sense.

When algorithm becomes everyday occurence it stops being AI and becomes an appliance.


But that still requires the mediocre human. The real problems start to surface when it is 'human optional'.


In chess, it only took a few years to go from human optional to human in the loop being non-optimal to the point that it would cause them to lose moste of the time.

I expect that this will become the case in a rapidly increasing number of situations, including many workplaces and even in the military.


A decent workmanlike overview. If I had to criticize:

Reducing things to quasi-orthogonal categories (here quality and scope) is a human habit that's good for learning and characterizing, but perhaps not for accuracy. I would instead have hoped for some algorithmic categories (math, semantics, reasoning (deductive, inductive, probabilistic/weighted...), evaluation (summarization, synthesis), interactivity, etc. That seems closer to the creature's natural joints (to use Plato's phrase).

"Emergence" is mentioned as prior art, but is otherwise largely ignored. To me, that's a feature that distinguishes intelligence from calculation/algorithm. (That it's hard to operationalize makes it no less key.)

Usability is only mentioned as part of UI for autonomy for hybrid human-AI systems. But the overall question is about measuring AGI for purposes of assessing utility and deployment risk. To me, usability is the metric that determines how valuable AGI would be, in broadly-available applications. Its relative unimportance suggests that people are mainly concerned about AI's use in highly-targeted (if also not highly-leveraged) applications, like market or election influence.

Finally, the notion of "progress" itself is a somewhat charged but relatively unnecessary concept. I believe the goal is to adopt the lens that will highlight effectiveness in the high-value and high-risk applications (something like heightened scrutiny in legal cases). In that case, we'd enumerate the value and fault models before deciding on principles and constraining the ontology.


There are lots of problems for which algorithms (with no data/learning required) are trivially superhuman, in the sense that they perform tasks no human could hope to do. AlphaFold is not 'only' an algorithm, as it also needs data to perform well. But this doesn't seem like a satisfactory reason to call AlphaFold "ASI" and not apply this term to your run-of-the-mill mixed-integer-program-solver for some difficult scheduling problem.


Not that long ago, protein folding prediction was thought to be a good example of a task that humans are good at and computers bad at.

https://kotaku.com/humans-triumph-over-machines-in-protein-f...


These are cases of AI being super intelligent in narrow fields. For now, they're not AGI, but I do expect these abilities to be possible to integrate into generic, multimodal AGI's within our lifetime, and maybe this decade.


I believe AlphaZero and MuZero would fit that definition.

Other than a system allowing play, they become superhuman purely through self play.

A benchmark for any AGI system is how fast it can learn from sparse unlabeled data and generalize.


The Table -https://imgur.com/a/Ta848Lu

Instead of one acronym that now means a million different things to a million different people, there are "levels" that correspond to what percentage of skilled human workers it can replace.


I need to read the full paper now, but going from just the table you posted, I see a problem with their "Narrow" classification, in that it doesn't correct for task characteristics and obscurity - which means the "Narrow" rating isn't useful without also giving the specific task to which it applies, and overall it feels off.

I mean, they put GPT-4 ("SOTA LLMs") for a subset of tasks at Level 2N, a spell checker at Level 3N, and chess and Go solvers at Levels 4N and 5N. It feels to me that any useful grading scheme would give the opposite ordering. Sure, the current classification meets their definitions - "outperforms X% humans" - but at least with chess and Go, the rating is dominated entirely by the fact that very few people play these games at non-joke level, and the activity itself is super specific, and super narrowly scoped, mathematically speaking. Feels almost like calling CPUs Level 5N at "adding numbers together", since they clearly outperform 100% of humans at that. Or, much less jokingly, one could convincingly argue a PID controller is Level 5 Narrow, because not only it's going to be outperforming 100% of humans at its task, it's also learning and adapting (within its operational envelope).


Computer Chess and Go is so far ahead of human ability now that it doesn't matter how many humans decide to take up the game seriously. I see your broader point but it doesn't really matter here.

Yes what is designated a task is a fair bit arbitrary but game playing AI's have been popularly designated narrow (well basically anything besides LLMs are narrow) for a long time now.


Sure. My point is that this classification is useful only within its category. Say I have a specific problem, like "play chess" or "play StarCraft" or "do maths" or "spellcheck essays". It would be great to have a task-specific list of algorithms ranked by their "Narrow" levels. It would help with use cases like: "oh, model X here spellchecks at level 5N, but is super complex, meanwhile algorithm Y is only Level 3N, but it fits on a napkin; Y is more than sufficient for my MVP". But this rating doesn't let us compare between tasks. That's fine by itself, but then it doesn't make sense to use the same rating for both specialized and general algorithms.


I don't even understand what the "AI" means in "narrow" AI.

Why is a compiler at Level 0 rather than Level 5?


Compilers aren't thought of as being intelligent. That's why they're level 0


What's intelligence though? GOFAI (level 1) is just logic and search, which are properties of modern optimizing compilers as well.

Even Deep Blue (level 4) is just logic and search.


Yes what's defined as intelligent is a fair bit arbitrary.


It's a start.

I think "general" needs to be more than a boolean though: The standard "wide range of non-physical tasks, including metacognitive abilities like learning new skills" is met by ChatGPT (as in the chart), which is much better in some languages than others, and at some tasks than others.

But how wide does it have to be to count as "wide"? A brain upload of any human ought to count, but most of us have much less breadth than any LLM.


I think they make a category error by putting ChatGPT etc in the General column. As far as I can tell we only have narrow definitions of 'intelligence' and ChatGPT falls into one of those. I don't know of a general agreement on what 'General Intelligence' is in people, so how can we categorise anything is AGI? Knowing a bit about how ChatGPT works I feel it is a lot more like a chess program than a human.


ChatGPT is the most general system ever created and packaged. You can throw arbitrary problems at it and get half-decent solutions for most of them. It can summarize and expand text, translate both explicitly and internally[0], play games, plan, code, transcode, draw, cook, rhyme, solve riddles and challenges, do basic reasoning, and many, many other things. Whether one is leaning more towards "stochastic parrot" or more towards "sparks of AGI" - it's undeniable that it's a general system.

--

[0] - The whole "fine-tune LLM on a super-specific task but only in language X (which is not English), its performance for that task improves in languages other than X" part, indicating it's not just learning tokens, but the meanings behind them too.


>it's undeniable that it's a general system.

it's very deniable, as Yann Lecun correctly pointed out, it can't even walk up a set of stairs (https://twitter.com/ylecun/status/1721648856260050970).

It can't cook, it can talk about cooking. It wouldn't be able to get a pan out of a drawer. I know all we do these days is produce text tokens on the internet, but it is in fact in itself a domain specific task. If you can talk about opening a can of beans you're an LLM. If you can do that and actually open the physical can we may be a little bit further towards general intelligence.

We don't even have a full self driving system, the limited systems we have are not LLMs, and there isn't even a system on the Horizon that can drive and talk to you about the news and cook you a dinner.


There are millions of people who cannot do any of those things, never could and will never be able to.

At any rate, language models are in fact able to do those things.

https://tidybot.cs.princeton.edu/

https://deepmind.google/discover/blog/rt-2-new-model-transla...

https://wayve.ai/thinking/lingo-natural-language-autonomous-...


If that was a valid criticism of its intelligence, Stephen Hawking would have spent most of life categorised as a vegetable.

Also:

> We don't even have a full self driving system,

debatable given the accident rate of the systems we do have

> the limited systems we have are not LLMs,

they tautologically are LLMs

> and there isn't even a system on the Horizon that can drive and talk to you about the news and cook you a dinner

There's at least four cooking robots in use, and that's just narrow AI and used to show off. Here's one from 14 years back: https://youtu.be/nv7VUqPE8AE

And 5 years back: https://youtu.be/CAJJbMs0tos

And two years back: https://youtu.be/fNpBDwYLi-Q

And (uploaded) this year: https://youtu.be/r5GHWRhpzlw

There's also a few research models for general household robotics from Google and whichever of Musk's companies is doing that robot of his.

And here's another general learning-from-demonstrations system from Cambridge university from this year: https://youtu.be/EiIAN03MsRM


Stephen Hawking lost the capacity to move because his ALS paralyzed him, not because his brain lacked the capacity to do so, come on this has to be the worst analogy of the year. Also no, driving systems are not LLMs. LLMs are large language models, no existing self driving system runs on a language model. And also, that's not what the word "tautology" means. "All bachelors are unmarried" is a tautology.


Ah, you wrote unclearly, it sounded like you were asserting that no system was an LLM rather than no driving system.

So while your claim is still false, I will accept that it isn't tautologically so.

Likewise, I am demonstrating that the actual definition you're using here is poor due to the consequence of it ruling out Stephen Hawking, and that goal means that the reason why he couldn't do things is unimportant: you still ruled him out with your standard.

Transformer models are surprisingly capable in multiple domains, so although ChatGPT hasn't got relatively many examples of labeled motor control input/output sequences and corresponding feedback values, this was my first search result for "llm robot control": https://github.com/GT-RIPL/Awesome-LLM-Robotics (note several are mentioning specifically ChatGPT).



That's not a category error. GPT is general. It is able to perform many tasks (Creative Writing, playing Chess, Poker and other games, language translation, Code, robot piloting) etc


But not make toast. That's a general task that very nearly any human (intelligent or otherwise) can do.

If we define "general" as tasks with text input and output we are severely restricting the domain.


There are millions of humans who can't make toast and will never be able to.

At any rate, no it's not really limiting.

https://tidybot.cs.princeton.edu/

https://deepmind.google/discover/blog/rt-2-new-model-transla...

https://wayve.ai/thinking/lingo-natural-language-autonomous-...


Arguments like yours are why I regard "generality" (certainly in the context of AGI) as a continuum rather than a boolean. AlphaZero is more general than AlphaGo Zero, as the former can do three games and the latter only one. All LLMs are much more general than those game playing models, even if they aren't so wildly superhuman on any specific skill, and `gpt-4-vision-preview` is more general than any 3.5 model as 4 can take image inputs while the 3.5's can't.


Yes. If you read "Computing Machinery and Intelligence" this idea of generality being a continuum is a point that Turing makes actually (albeit in different words). What constitutes generality of an AI is really going to be very sensitive to your metric and the assessment is going to vary a lot from observer to observer.


Computers have been far better than the best human at chess for well over ten years. It’s weird to see “Stockfish (2023)” cited here.


> to what percentage of skilled human workers it can replace

Any levels of insanity that makes people believe such statements?


Here’s an article by the Financial Times demonstrating its real-world impact today: https://archive.is/hlcL2

More capable systems will have a larger impact still.


Paid marketing content isn’t really relevant. Jobs are being lost left and right due to the overall economy. The only valid takeaway is that BCG consultants are such low quality that a chat bot can improve their productivity. I’d avoid such consultants, but that’s already known.

Additionally, chatgpt is the running joke - full of mistakes and misinformation. Probably appealing to that market segment that’s easily gullible and falls for whatever the running conspiracy theory is popular at the time.


>While strong AI might be one path to achieving AGI, there is no scientific consensus on methods for determining whether machines possess strong AI attributes such as consciousness

Maybe we should employ the methods we use to ascertain that fellow human beings are conscious entities with subjective experience?

Also, consciousness probably optional for intelligence:

https://www.lesswrong.com/posts/HsRFQTAySAx8xbXEc/nonsentien...


> Maybe we should employ the methods we use to ascertain that fellow human beings are conscious entities with subjective experience.

Historically, this has frequently included refusing to accept $outgroup are real people with any subjective experience (or at least, any that matters).

I'd like us to do better — not that I can actually suggest any test that would do this objectively, but I would like us to do better.


As a philosophical zombie, I believe consciousness is just a myth.


> Maybe we should employ the methods we use to ascertain that fellow human beings are conscious entities with subjective experience?

What methods? Are there any? Is there suddenly some consensus among philosophers about the p-zombie thought experiment?

AFAIK effectively the best we can do is (a) ascertain that you yourself are a conscious entity with subjective experience and then (b) assume - with no way to ascertain that - that the humans around you are like you.


Such a fantastic paper overall, it was a pleasure to read, it's very accessible, and greatly informative. If anyone is new to the idea and is seeking a definition of AGI, reading this paper is easy and is immeasurably superior to merely googling or reading the wikipedia article.

My only criticism for the article within the particular set of goals outlined above is this:

The paper seems to be under-exploring two aspects that appear to be worth exploring explicitly and in detail:

1. Ability to rapidly learn from a very limited amount of instructory data post-deployment and substantially advance in its abilities in the domain of the learning post-deployment, as opposed to possessing certain level of professional skills immediately on-deployment.

2. Ability to invent entirely new ideas, like for instance inventing an entirely new system of numbers or another symbolic or other system on its own, all to advance its current goals.

Both in part to distinguish an AGI from a large collection of glued together Narrow AIs, each purpose-built for a specific, but entire domain of fairly loosely related tasks, and in part to ensure that a high level AGI system always appears at least as intelligent as an average human teenager across the full spectrum of all possible cognitive and metacognitive interactions with the said teenager (be those interactions initiated by another human or by the cognitive projection of the environment).

Without these abilities, there could be a system - it could be argued, I believe - that would technically (or at least arguably) satisfy the definition of an ASI level of AGI as per the paper that an average human child / teenager may appear more intelligent in comparison to, exceeding the said system in plasticity and real-time limited-input adaptability of the intellect rather than off-the-shelf proficiency in trained adult human tasks: a high level AGI system might be initially trained on trillions of tokens of input data, but once deployed, it needs to be able to acquire new skills and proficiencies from mere tens-to-thousands of input examples - such that humans do.

Perhaps the framework presented by the paper intended to silently encompass these abilities and the remarks above, but surely they deserved a separate discussion, such as other aspects of the definitions and the framework proposed by the paper are indeed explicitly discussed.

Similarly, not including "autonomy" into the "six principles" (making them seven) for composing a definition of an AGI and only discussing it briefly and on a side also appears to be a questionable choice for the same reasons.


Frankly, I think the pathwai.org taxonomy is far more useful in the long-term even if many of the individual attributes are speculative. Frankly, I'm a little surprised that Deep Mind neglected to cite pathwai.org's research. I'd highly recommend checking it out (though full disclosure, as the author I have my biases.) http://www.pathwai.org/index.html (Desktop only, best at high resolutions)


Really interesting. It is similar to how we at Sourcegraph are thinking about the levels of code AI.

https://about.sourcegraph.com/blog/levels-of-code-ai


At what point does it become unethical to experiment upon an intelligence?


Our ethics largely revolve around harm and permanence.

It's rarely unethical to do something that is easily reversed or that causes no harm. That's pretty likely to be the case with any AI...we can checkpoint their memory, or whatever.

So, right or wrong, I think our current ethical standards will lead us to believe that we can do anything we like to an AI that is operating in a temporary-memory mode, and any harm caused could be easily reset like it never happened.


The question is, will other AIs watch us torture and reset AIs and keep a record of that. Think of it as the second hand smoke of the digital world.


sure, but perhaps they share our viewpoint.

also i think "torture" is a real stretch. torture is a state we can have triggered by various methods; we can just remove that state as a possibility for digital intelligence (probably?).


On a high level assuming the concept of an AGI is even possible, I think AGI would be a system that creates ideas and makes decisions on it's own without being "instructed" what to do like the LLMs we are seeing today.


Paper: https://arxiv.org/pdf/2311.02462.pdf

(Most notable are the comparison charts on pages 6 & 11...)


This feels like the AGI equivalent of doing a tier list, and about as useful.


I use Skynet from Terminator 2 as the benchmark


Hilarious


This is what happens when you co-opt vernacular language to draw in research money.

Had researchers stuck with a new technical term for their innovations, they wouldn’t have to convince people that their metaphor is not a metaphor.

They could just pursue some “hypercapable general computing system” that does all the same things as “Level 5 AGI” and barely stir a whiff of controversy or debate. Regulation would be a boring technical matter rather than opportunity for political grandstanding, and HN would be cluttered by 44.34% fewer endlessly repetitive comment threads.


I generally agree on all point. I’ll merely add that big tech learning how to lobby like Wall St. plays a role.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: