The lack of detail here makes this post pretty useless, though I guess I’m not surprised generic docs bots aren’t that great.
Without knowing any more detail than “We got in touch with a few docs bot services and set up demos that were trained on our docs and blog posts.” it is hard to generalize to RAG + chat in general. I’ve had very good results with a custom setup that uses Claude Haiku to narrow down the set of relevant docs for a question and then 3.5 Sonnet to answer it. The corpus is on the small side, so no vector embeddings or even text search are required — the trick is understanding the different kinds of docs involved (OpenAPI schemas, hand-written guides) and writing code that abbreviates them in an appropriate way for the retrieval/narrowing step to work well.
I also manually tuned the system prompts to get the kind of answers I want and avoid the ones I don’t. I imagine off-the-shelf solutions are mostly lacking this customization, and they kind of can’t add it, because if they do, you’d be wondering what the value-add is and why you don’t build the same thing yourself in a couple of days. I’m sure techniques will improve, and it’s possible that turnkey solutions will be decent eventually.
I also think the distinction between supervised and unsupervised is misapplied here at the end, even accepting the colloquial use of a technical term. A docs tool powered by a bunch of hand-written documents and a custom system prompt, with a person asking questions of it — that doesn’t sound very unsupervised.
I don't think I mean to indict RAG + chat in general! I think it's totally possible that, if we put more work in, we'd get a great bot out.
But the bar is so, so high though. It's gotta be a truly great bot for us to not be scared of misleading our new users. And I'm still worried that "truly great" is going to take a LOT of work.
And for now, that's the problem. We're still a startup with limited resources. This tool isn't ready for us because we don't have the bandwidth to put the work in.
I can't wait til that bar drops, though. GPT 4o is a really solid step in that direction.
That much I will concede. I said we’ve had good results, but we’ve still been a bit scared to roll it out, more for potential cost and polish reasons than baseline quality, but of course I’m still worried about it saying something wrong.
Oh yeah, and I was worried about the "supervised/unsupervised" comment you made.
I'm not talking about supervised training. I think I mean to say that the OUTPUT is supervised/unsupervised. Like, I'm an experienced programmer, so I can supervise the output of Copilot, unlike our unexperienced docs users.
That's on me for not making that train of thought clear enough, and unfortunately choosing a term that's already in use by the AI/ML industry.
Yeah, I had some promising results in a project that split markdown-based docs by second-level headers, embedding them all, and then doing basic RAG with GPT-4 serving a response. It was too slow at the time (June last year) but I'll probably pick it back up again this year.
The main things I took away were (1) if the information archictecture isn't very splittable, this gets too hard, and (2) always link back to source information.
Agreed on both counts. I do the same thing with headings and I use the results of the retrieval step to display a list of relevant docs while the answer is generating.
The latest models are way better and faster than GPT-4 was. You’ll probably be happy when you get back into it.
They say that copilot is great but the chatbot was bad because it lacked context, missed nuance, and frequently gave correct-sounding answers that actually weren't great.
That's exactly my experience with copilot. For simple text transformation (see this pattern? Repeat it. See this comment? Make another. See how I'm doing this? Finish it) it's ok but when asked to write any amount of code beyond that, it only writes mediocre good-seeming code that when truly analyzed within the context of the codebase, the feature, the libraries in use, the framework and language... suffers from all the same problems. The number of times I've had to walk a junior through some bad react code that an AI wrote and ask them "what do you think this is doing" and have them shrug at me is ridiculous.
On the bright side, those of us at businesses that still care about quality have a lot of work on our plates cleaning up the AI slop getting pushed into code reviews, so yay job security
No yeah. You're totally right. I also see copilot lacking context, missing nuance, and frequently giving correct-sounding answers that actually aren't great.
I still think copilot is great for me, an experienced programmer who can recognize the bad.
But it is a bit scary that it's also empowering less-experienced end-user programmers to write bad code! It's the same story as our docs bot. AI is a good assistant, but it's not ready for taking over, yet.
I think Discord is where these bots belong. For starters, it’s a context where people expect to chat. People are also accustomed to interacting with bots there, even before the rise of generative AI. And most importantly, unlike when it’s embedded on a website, in Discord the bot can be supervised, corrected by members of the community. You could probably even set up a system where the reactions of certain users, e.g. the project’s maintainers, get fed to the model as training data. I think this could work really well.
One obvious downside is that people may be more reluctant to ask embarrassing questions in public. Though, you could allow DMs to the bot to help with that.
> in Discord the bot can be supervised, corrected by members of the community.
I have volunteered to answer questions on IRC, sometimes ones that require quite a bit of research. But when I do that, there’s a human on the other end that can learn and at least move in the direction of not having that problem again. I don’t think I’ll ever spend my time correcting the mistakes of a bot that will just as confidently make them time and time again.
A few days ago I prototyped an AI chatbot which has access to our product's documentation and so far whatever I throw at it, it answers pretty well, without hallucination. It uses GPT4o-mini and OpenSearch for hybrid search (with custom parsing and indexing). After answering a user's question, it also links to the articles where it found the information.
My conjectures are:
1) Their bot had bad retrieval.
2) Their bot had a subpar prompt.
3) Their bot had a subpar LLM.
4) Their documentation is a confusing mess.
5) I didn't test my AI chatbot well enough :)
Without the specifics, it's hard to draw conclusions from the article.
Really excited to investigate hybrid search. I'd feel a lot more confident providing a list of links instead of a generated answer. Seen any good services that could help us out, here?
LLMs definitely make "mistakes". It's well-documented by both users and the providers themselves. Even if 5-10% of questions get a hallucination that sends someone down a totally wrong path, that's too much. It's a really high bar, to be clear, but an important one imo.
Without knowing any more detail than “We got in touch with a few docs bot services and set up demos that were trained on our docs and blog posts.” it is hard to generalize to RAG + chat in general. I’ve had very good results with a custom setup that uses Claude Haiku to narrow down the set of relevant docs for a question and then 3.5 Sonnet to answer it. The corpus is on the small side, so no vector embeddings or even text search are required — the trick is understanding the different kinds of docs involved (OpenAPI schemas, hand-written guides) and writing code that abbreviates them in an appropriate way for the retrieval/narrowing step to work well.
I also manually tuned the system prompts to get the kind of answers I want and avoid the ones I don’t. I imagine off-the-shelf solutions are mostly lacking this customization, and they kind of can’t add it, because if they do, you’d be wondering what the value-add is and why you don’t build the same thing yourself in a couple of days. I’m sure techniques will improve, and it’s possible that turnkey solutions will be decent eventually.
I also think the distinction between supervised and unsupervised is misapplied here at the end, even accepting the colloquial use of a technical term. A docs tool powered by a bunch of hand-written documents and a custom system prompt, with a person asking questions of it — that doesn’t sound very unsupervised.