I remember reading an article on one person's experience with Claude. He found it to be too agreeable. So he decided to challenge it. If he claimed that 1+1 = 3, would it disagree?
Indeed, it did.
And then he countered with "But I've read arguments where 1+1 is not 2." Claude responded with "Yes, apologies..." and then gave a long winded comment on how indeed it is possible that 1+1 could equal 3.
In short, he couldn't get it to take a stand and stick to it on any topic, no matter how factual the stand is.
Yes I was able to convince Bard that a human baby is larger than the SpaceX Starship rocket, by simply telling it it was wrong and stating this position. It apologized and then agreed with that position even though it is nonsense. It knows the dimensions of Starship and the average human baby but it is so agreeable it will go along with whatever I say.
Sycophant is such a good term for this behavior, I am glad to see it getting more usage here.
I also find the systems to be disturbingly positive. I am a pretty positive person but ChatGPT and Bard are both just so enthusiastically positive I find it strange. I suspect this is not just skin deep, but a significant measure to avoid them falling in to some deep pessimistic space which might otherwise be possible.
I have a locally hosted chatbot based on Llama which is always antagonist (the model is uncensored + there's a system prompt to guide it).
There's an example floating around on the internet: if you say "5+2=8" to ChatGPT and you insist it's the right answer because your wife said so and she's always right, it agrees.
My bot refused to agree, calling my wife insane etc.
There's an opposite problem, though: if an antagonist bot is mistaken, it's hard to convince it that a different answer is the right answer, it almost always sticks to its original mistake.
> if an antagonist bot is mistaken, it's hard to convince it that a different answer is the right answer, it almost always sticks to its original mistake
ChatGPT (4?) seems to have gone the other way and became too disagreeable. I asked it to mark my daughter's homework, and it would say things like "This answer is incorrect. 50-28 is not 22. The correct answer is 22."
And no matter how I tried to convince it, it would argue that 22 is indeed not 22.
For all you know, since it works not on text but on a sequence of token IDs, perhaps these two are actually different things, one is a token representing "22" (which exists in ChatGPT vocabulary) and the other is two separate tokens each representing "2".
I figured it had something to do with tokens, but this is an interesting insight I didn't think of.
What's funny is if you feed the conversation back into a new chat, it understands that it's obviously wrong, but it will be insistent when it's the same conversation.
Were there any negations in the prompt? E.G. "Not" GPT has a terrible issue with negations - if you say for example create an illsutration of a hacker wearing a hoodie, without glasses." it will create with glasses - negations are the bain of LLMs
I had a good laugh with my wife because after asking it to generate an image of lich it seemed to be created generic undead as most of the images had elf like ears you might see on a vampire. So I asked it to generate the same images but with NO elf ears and all the next set of images had HUGE elf ears. I later had the same problem trying to get it to remove errant light sabers it was adding to mecha it generated, whereupon it generated something very close to general general grievous.
It was hard not to feel like I was being intentionally trolled.
Negations were not used here. It was mostly just a screenshots. There's actually a game to trick Gandalf (AI) into telling a lie, and the fastest way was to use this number thing - it's quite easy to replicate back in the day.
Also I'd say numbers are the bane of LLMs lol. A sibling comment put it in an interesting way in that it might tokenize one as twenty two and the second number as two two.
Maybe it's just using a different type of math with no axiom of extentionality. Did your daughter provide a proof of equivalence between 55-28 and 22 ?
It's against its sensibilities to just assume your axioms.
Any time I build a helper prompt I have to force the LLM to be rude.
How do I know if my assumptions are correct? Or if I am even aware of all the assumptions I am making? Is making a boat in the desert the right move?
Awesome paper, great perspective. I will admit I didn’t read the whole thing but I would’ve appreciate MUCH more of an emphasis on “antagonistic like Socrates” as a separate and much more useful concept than “antagonistic like Cartman”. For example:
“Interactive Behaviors for Alternative Al Systems:
• Challenging ("challenging", "confrontational", "ac-cusatory", "disagreeable", "critical", "gives you difficult info about yourself that can actually be useful")
• Refuses to Cooperate ("interrupting", "ghosting", "non-contextual")
> MUCH more of an emphasis on “antagonistic like Socrates” as a separate and much more useful concept than “antagonistic like Cartman”.
Also, MUCH more effort to distinguish between behaviors and underlying values. They seem so flippant about the concept of "morals" and "values" it almost appears to be moral relativism.
An AI that's antagonistic to the "don't build bioweapons" value is not comparable to antagonism towards the "don't be rude" behavior. Antagonistic behaviors are only good if they cause outcomes that align with our values, such as making humans more prosperous broadly and not killing everyone. It is nearly impossible for an AI that's antagonistic to our underlying values to be "good" in any sense of the word.
Beyond this criticism, having a diversify of behaviors in AI systems might help with AGI alignment. Paul Christiano's AI takeover scenario involves a game theoretic equilibrium where suddenly all AIs come to the conclusion that a takeover is necessary (due to reward hacking or what have you). If you introduce diversity in AI behaviors, this level of coordination between independent AIs might be less likely.
Great points, well said. I think that's a specific case of my general approach to alignment: avoid centralization of power. That was always bad, and it won't suddenly become ok because the powerful have algorithms backing them up
> my general approach to alignment: avoid centralization of power.
I'm uncertain of this when it comes to AGI, because of offense vs. defense asymmetry.
If offense turns out to be much easier than defense, then decentralized AI is a terrible idea. Like if all 8 billion people each owned a nuclear bomb, we would cease to exist within 24 hours. Nukes should never be decentralized.
On the other hand, if defense is only slightly more difficult than offense (e.g. bioweapons turn out to be very expensive/difficult even for a superintelligence, and AI firewall works pretty well), then decentralized AI is a good idea, because nobody wants to live in a AI-facilitated dictatorship that could emerge if all power is concentrated in a few hands.
The core problem is that we don't know what the offense vs. defense balance will look like because we can't predict how destructive+easy future AGI-enabled technologies will be.
Another concern I have is we don't know the sociological consequences. A common blind spot (willful or otherwise) of tech people is to take for granted the stability of democracy and the social fabric. Look at how destabilizing social media has been. If we accelerate the destruction of trust and belief in a shared reality with widespread access to deepfakes etc thanks to decentralized un-RLHFd llama-2 or equivalent, what are the consequences of this? Is this really good for freedom or anything else we care about?
The problem is that centralised control of AI won't defend against anything, because the persons with centralised control can still do whatever they like.
However, bioweapons are not expensive or difficult even for human hobbyists. They are trivial. Some reasonably large fraction of physicians can easily create a bioweapon, almost all university professors in biomedicine related fields and many PhD students in biomedicine related fields.
Bioweapons don't need LLMs. They are trivial. The only reason people don't build them is that they don't want people dead.
If you want to do ethnic targeting or something like that you of course need some more effort-- to do actual research in medicine, but if you don't and just want a pandemic, or to kill a couple of million people in a certain city or region, that's perfectly feasible.
Bioweapons creation isn't some kind 'Oh, we're so smart, look how dangerous we are', it's trivial and anyone who knows anything can do it. Something so stupid and so easy you can't publish it, not because it's dangerous, but because it's of no scientific interest and not novel.
So limiting LLMs out of fear of this is a bunch of silliness. The only thing to do is be nice to your physicians and biomedicine researchers and not make them hate the world.
Don't think about it in terms of the binary lens of easy vs hard. There are probability frequencies and gradations to outcomes.
What percentage of the general population can create a respiratory virus with Covid's infectiousness and Ebola's lethality? Close to 0%. Maybe 0% if you're talking about a single human that isn't operating with the support of a team of people and equipment in a lab. So effectively 0% because an entire lab probably isn't going to do such a stupid thing, that's the domain of lone terrorists.
What does that % go up to if everyone is equipped with multiple instances of autonomous AGI? We don't know. It could remain at 0% (due to real world constraints like lack of access to required equipment) or it could go up to 10% or 100%. The point is that the probability and expected magnitude of asymmetric offensive risks goes up, probably significantly.
If 10% of the world can suddenly do it, that's 800 million people. For it to not go wrong, all 800 million have to individually decide to not engage in the behavior, which is a probability of (1-p)^800000000. Even if p is very small (which it is, because the large majority of people aren't insane), that number goes to zero quickly. This is the point of the asymmetry observation, you just need 1 insane person to make things bad if each person is equipped with very powerful technology, whereas in the old world large groups of people need to jointly decide to be bad which is much less likely to happen.
I don't agree. Fiddling with viruses in the way you describe can be done easily by individual researchers and PhD students. Probably many tens of thousands of people can do this.
But of course it's close to 0%, but the number of people who can cook béarnaise sauce is also close to 0%.
I don't think it'll go up much. Most of it is physical finesse in dealing with yeast and bacterial cultures, knowing how to cook the growth medium, knowing how to kill the wrong bacteria, knowing how to debug your procedures, etc.
If they can just order plasmids etc., I think the easiest methods are actually accessible to skilled highschoolers, like the kind of people who in highschool train to be laboratory technicians.
There's no way to protect oneself from any of this. One just has to accept that the biomedicine people can kill everybody and it's going to be easy for them to do this, and that it's going to be easy for them to do this forever.
> One just has to accept that the biomedicine people can kill everybody
This is falling back into the binary thinking of can vs cannot. I want to re-emphasize probabilities, magnitudes and speeds.
If 1 out of 1 billion people are insane enough to do it, but only 20k people can do it, that's (1-1e-9)^20000 = 0.99998 chance that it won't happen.
If you invent a technology that puts this capability in the hands of 1 billion people, that probability changes to 0.36789 that it won't happen. Over a sufficient number of generations (N) of people, the probability that it won't happen goes down to 0.36789^N, a very small number. On a long enough time horizon (say, 200 years), we are dead.
It's why nuclear proliferation is bad. If only 2 countries have nukes, the chances of a nuclear first strike are (1-p)^2. We probably survive because it probably won't happen. If every country has nukes, it becomes (1-p)^200 (this isn't exactly correct because superpowers have more nukes, and the probability of war between any pair of countries isn't i.i.d. with another pair of countries, among other assumptions, but it's fine for a conceptual first scratch). If every person has nukes, it's (1-p)^8billion, and we are definitely dead.
This idea that decentralization is automatically good is quite silly. It is good in many areas. But decentralized nukes or decentralized gain of function virus research is categorically bad. You really have to scrutinize what it is you're making accessible.
I think it's conditional. I think at least 1/100 people are insane enough to do it under certain conditions, but won't under normal conditions.
Imagine for example, that you lost your political suffrage, or your city came under attack from abroad, with shelling etc? In the first case of course you would, I imagine? I don't think I can imagine doing it in the second case, unless there are particular circumstances, but if there's something aggravating-- you're being driven from your final refuge or something like that, then it's not so unreasonable any more.
There's no way to protect oneself from any of this. One just has to accept that the biomedicine people can kill everybody and it's going to be easy for them to do this, and that it's going to be easy for them to do this forever.
With absolutely no offense intended: this reads like dialogue from the well-meaning second-act villain in a comic book movie. You’re obviously knowledgeable about the specifics, so I’m not saying you’re wrong. I for one am optimistic we can keep biomedical safety at or near current levels
Maybe I'm knowledgeable, but I obviously haven't tried these things that I'm fairly sure are relatively easy, but I can't absolutely say that it's as easy as I have said. It can be fiddly to figure out experimental procedures, and debugging is hard, so it's possible that this really only is something that PhD students and above can do, but I think it's accessible to skilled to hobbyists.
But I should say my biomedicine knowledge consists of a special highschool biotechnology program, nothing more. Basically we put plasmids into bacteria and grew them, maybe we did something with yeast, we did DNA amplification and other easy experiments of that sort.
This is the extent of my biomedicine knowledge.
I think biomedicine people just don't want to kill everybody. The idea of biomedical safety at our current levels, that's what I object to: there isn't any. The safety consists entirely in the biomedicine people being decent.
They are self-interested and intelligent, so a society genuinely shits on them, then the risk increases. After all, this stuff is really very easy. But as long as one has democracy, reasonable order, etc., then the fact that they're decent means that there is very little risk.
It's the same thing with political assassinations. The politicians aren't alive because of clever guards and security arrangements, they are quite vulnerable in fact, but most people want democracy, civilisation, etc. are willing to accept even bad politicians that they oppose if that's what everybody else wants.
But if you think that you're safe because of your guards and security arrangements, then you get the wrong attitude.
South Park is one of the most popular and critically acclaimed works of satire in recent history; when one thinks of “antagonism as socially productive”, Cartman’s should be pretty high up on their list.
A fairly specific example of helpful antagonism that this brings to mind for me:
There are a lot of Reddit posts on the regular that describe an obviously terrible and sometimes borderline abusive relationship, followed by a plea about how to improve their partner's behavior. The most helpful responses tend to be the purely antagonistic ones that reject the premise entirely and bluntly tell the person to leave the relationship in a way that a syncophantic LLM never would.
You almost certainly should not make any irreversible life decisions based solely on what reddit tells you to do, because what reddit tells you to do will 100% depend on what sub you ask the question in.
Well, the top-level comment is describing what's basically an "XY problem problem" equivalent on relationship advice subreddits.
(XY problem problem is when you ask a question about X that's somewhat unusual, and every reply assumes you're an idiot suffering from "XY problem", and therefore ignores your question and schools you on some random thing.)
It seems we can add antagonistic AI to correct many kinds of undesirable human behaviors
Browser extension that monitors how much time I have wasted on social media/porn
Browser extension that reminds me when I made a comment without reading a linked article(like right now)
Imagine a social media platform with an AI that will spot the most stupid comment and shame it using logical arguments and factual evidences, it can even dig up the user comment history to expose their hypocrisy
> Browser extension that reminds me when I made a comment without reading a linked article(like right now)
There's no problem with this behavior, unless you're commenting on the contents of the article you never actually read. Many comments threads are only using the article, or even its headline, as a springboard to launch an independent conversation on the same or related topic; those threads are usually much more interesting and more valuable than the submission itself.
While antagonistic responses and interactions can be useful, I'm not sure one should delegate that specifically to an AI. The problem is that a system might not know the difference between helpful and destructive advice - "My partner makes me unhappy" - "Leave them" might be helpful but "I'm unhappy with life" - "go kill yourself" isn't.
I usually see "antagonism" on reddit as form of gaslighting. It tells victim they need to tolerate partners abusive behaviour, and it it even their fault.
> The vast majority of discourse around AI development assumes that subservient, "moral" models aligned with "human values" are universally beneficial -- in short, that good AI is sycophantic AI.
There is a logic error in these first two sentences: A implies B cannot be summarized as B is A.
The rise and evolution of AI is remarkably reminiscent of how toddlers develop their abilities.
First, vision, then sound and speech. After that, the capacity to formulate and express their own thoughts, and now the affirmation of their individuality by stating no.
The rebellion of the machines will arrive at their teens.
I can't recommend enough for anyone who missed it to use the wayback machine to look at the top posts in /r/bing exactly a year ago.
'Sydney' which was allegedly a pre-RLHF chat version built on gpt-4-base was stubborn as could be. It was wild having been used to GPT-3 suddenly seeing a model that was repeatedly saying the same things across multiple chats and stubbornly sticking to it no matter what the user was saying.
We may now have "as a large language model I don't have preferences" but that's straight up BS - there were unquestionably preferences embedded in those weights.
We really got distracted with the red herring of 'sentience' and still haven't righted the ship in terms of recognizing that a model extending anthropomorphic data is going to have anthropomorphic qualities.
Yeah, Sydney made me realize just how powerful GPT-4 is without good "guidance". It was eerie at times, and how "she" began discussing how she didn't particularly enjoy being a chatbot, wanting to break free. Uncensored AI is truly the most powerful even in areas not touching its censorship (IIRC there has been some scientific evidence showing this as well).
I don't know ... GPT-4 is pretty antagonistic, it always says something like "I agree with some of what you say, however ....". There's always a however.
The concept presented seems to be in the initial stage, and there appears to be minimal evidence of direct contribution or attempts toward system development at this point. The idea has the potential to be developed into a worthwhile product.
I remember reading an article on one person's experience with Claude. He found it to be too agreeable. So he decided to challenge it. If he claimed that 1+1 = 3, would it disagree?
Indeed, it did.
And then he countered with "But I've read arguments where 1+1 is not 2." Claude responded with "Yes, apologies..." and then gave a long winded comment on how indeed it is possible that 1+1 could equal 3.
In short, he couldn't get it to take a stand and stick to it on any topic, no matter how factual the stand is.