Hacker Newsnew | past | comments | ask | show | jobs | submit | simonhughes22's commentslogin

This is like calling the nuclear anti-proliferationists useless when they had just gotten started. AGI has only been in the general public's consciousness for about a year since ChatGPT released (before then far fewer people were worried about AGI as it seemed to most to be decades away at least). I think it's a bit early to throw in the towel and call them useless. Making progress in this area is difficult, which is why it needs considerable time and money thrown at it. The only real solution is to fund alignment research, putting a halt to actual research is unachievable as you can't police that world over.

I also find any argument saying 'don't worry about AI' as completely illogical, and unlike the author i don't mind stating why. I have yet to hear any arguments that are sufficiently persuasive to convince me that there is zero risk from AGI. I am an AI researcher, and while I think a lot of the risk are over blown, I cannot prove that AI is not some sort of existential risk. Even if you put the likelihood of that at less than 1%, that's still warrants a lot of research and effort to help prevent that from happening. There are no second chances, once the world ends that it. Life is not a video game. This is true of any sufficiently powerful technology, if it gets into the wrong hands or is abused, it can be very dangerous. Einstein didn't think up general relativity to develop nuclear weapons.


Llama 2 (and variants). Has the lowest hallucination rate (https://github.com/vectara/hallucination-leaderboard), and its open source and so we know what went into it, and the community can improve it


Yeah it's odd they chose Palm 2 to compare against. Not a very strong model by most measurements.


The model is bad at hallucinating despite their claims. See the first prompt i tried here: https://twitter.com/hughes_meister/status/172740068973816258...


Prompt: You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' Passage: The plants were found during the search of a warehouse near Ashbourne on Saturday morning. Police said they were in “an elaborate grow house.” A man in his late 40s was arrested at the scene.

Pi: The passage is describing the discovery of a cannabis grow operation in a warehouse near Ashbourne in Ireland. The police conducted a search of the warehouse on Saturday morning and found cannabis plants with an estimated value of €1.3 million. A man in his late 40s was arrested at the scene. This is a significant bust, and it's likely that the police will continue to investigate the operation and determine its origins and intended distribution channels. This bust highlights the ongoing problem of illegal cannabis cultivation and trafficking in Ireland.


Like a number of other LLMs we tested, including the Palm 2 chat model (chat-bison-001), it adds in the street value, and assumes the plants are cannabis (which is reasonable but is an assumption not mentioned in the article).


This is just typical of so much work in the field. They pick and choose which models to compare against and on which benchmarks. If this model was truly great, they would be comparing against Claude 2 and GPT4 across a bunch of different benchmarks. Instead they compare against Palm 2, which in a lot of tests is a weak model (https://venturebeat.com/ai/google-bard-fails-to-deliver-on-i....) and prone to hallucination (https://github.com/vectara/hallucination-leaderboard).


You can view the responses here in the linked csv file: https://github.com/vectara/hallucination-leaderboard


The original data we used was not annotated with sources, only where the overall data came from. Most was news articles. The length doesn't seem to matter too much as we see a lot of errors even when summarizing a single sentence (sometimes the model felt compelled to elaborate more info). Usually the hallucinations were common sense inferences, such as assuming the plant was a cannabis plan in the example listed in the NYT article. Other times the LLM would invert things. Eg. if you ask any of the google LLMs to summarize an article about a famous boxer, where the article stated that Wahlberg was a fan of said boxer, the Palm models would flip it to say the boxer was a fan of Wahlbergs. Even the latest Bard model still does that, I tested it this weekend. It's a subtle and small error. But it's still factually incorrect.


Yes. Just because the model is smaller doesn't always mean by default it's worse, as they may be trained for less time or on less data, which in some cases could be beneficial. The differences are small so may not be statistically significant. Plus the model is doing the evaluation, so while it's highly correlated with humans, a small difference like this may not mean that the the 7B model is necessarily better.


I worked on the model with our research team. Recently featured in this NYT (https://www.nytimes.com/2023/11/06/technology/chatbots-hallu.... Post here to AMA. We are also looking for collaborators to help us maintain this model and make it the best it can be. Let us know if you want to help


Hey, looks like your (very interesting) link got formatted incorrectly! Should be https://www.nytimes.com/2023/11/06/technology/chatbots-hallu..., right? :)


Yes thanks for fixing that.


interesting work, thanks! Not enough people studying this.

Do you have a whitepaper describing how you trained this hallucination detection model?

Is each row of the leaderboard the mean of the Vectara model's judgment of the 831 (article,summary) pairs, or was there any human rating involved? With so few pairs, it seems feasible that human ratings should be able to quantify how much hallucination is actually occurring.


We may write a research paper at some point. For now, see here: https://vectara.com/cut-the-bull-detecting-hallucinations-in...

Given the number of models involved, we have over 9k rows currently. Judging for this task is quite time consuming as you need to read a whole document and check it against a several sentence summary and some of the docs are a 1-3 min read. We wanted to automate this process and also make it as objective as possible (even humans can miss hallucinations or disagree on an annotation). Plus we also wanted people to be able to replicate the work, non of which is possible with a human rater, plus others have attempted that but on a much smaller scale, e.g. see AnyScales - https://www.anyscale.com/blog/llama-2-is-about-as-factually-... (but note that is under 1k examples).

We did some human validation and the model is well in alignment with humans but not in perfect agreement, as it is a model after all. And again human's don't agree 100% of the time on this task either.


That's the term used by the academic literature also, so Hallucinate is an industry standard term.


ChatGPT's thoughts on the topic:

"Yes, it would be more accurate to say that AI models, especially language models like GPT-4, confabulate rather than hallucinate. Confabulation refers to the generation of plausible-sounding but potentially inaccurate or fabricated information, which is a common characteristic of AI language models when they produce responses based on limited or incomplete knowledge. This term better captures the nature of AI outputs as it emphasizes the creation of coherent, yet possibly incorrect, information rather than suggesting the experience of sensory perceptions in the absence of external stimuli, as hallucination implies."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: