Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.
Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.
However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):
Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)
We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.
I often see these messages from the community doubting the reality, but LLMs are a powerful tool in the tool chest. But I think most companies are not staffed with skilled enough engineers with a creative enough bent to really take advantage of them yet or be willing to fund basic research and from first principles toolchain creation. That’s ok. But it’s foolish to assume this is all hype like crypto was. The parallels are obvious but the foundations are different.
No one is saying that all of AI is hype. It clearly isn't.
But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.
There are very strong parallels to crypto in that (a) people are starting with the technology and trying to find problems and (b) there is a cult like atmosphere where non-believers are seen as being anti-progress and anti-technology.
Yeah I think a key is LLMs in business are not generally useful alone. They require classical computing techniques to really be powerful. Accurate computation is a generally well established field and you don’t need an LLM to do optimization or math or even deductive logical reasoning. That’s a waste of their power which is typically abstract semantic abductive “reasoning” and natural language processing. Overlaying this with constraints, structure, and augmenting with optimizers, solvers, etc, you get a form of computing that was impossible more than 5 years prior and is only practical in the last 9 months.
On the crypto stuff yeah I get it - especially if you’re not in the weeds of its use. A lot of people formed opinions from GPT3.5, Gemini, copilot, and other crappy experiences and haven’t kept up with the state of the art. The rate of change in AI is breathtaking and I think hard to comprehend for most people. Also the recent mess of crypto and the fact grifters grift etc also hurts. But people who doubt -are- stuck in the past. That’s not necessarily their fault and it might not even apply to their career or lives in the present and the flaws are enormous as you point out. But it’s such a remarkably powerful new mode of compute that it in combination with all the other powerful modes of compute is changing everything and will continue too, especially if next generation models keep improving as they seem to be likely to.
That text applies to basically every new technology. Point is that you can't predict it's usefulness in 20 years from that.
To me it still looks like a hammer made completely from rubber. You can practice to get some good hits, but it is pretty hard to get something reliable. And a beginner will basically just bounce it around. But it is sold as rescue for beginners.
I didn't see anything in the article that indicated the authors believed that those who don't see use cases for LLMs are anti-progress or anti-technology. Is that comment related to the authors of this article, or just a general grievance you have unrelated to this article?
> We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.
That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.
Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.
However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):
Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)
You've linked to a query generator for a custom programming language and a 1 hour video about LLM tools. The cynic in me feels like the former could probably be done by chatgpt off the shelf.
But those do not seem to be real world business cases.
Can you expand a bit more why you think they are? We don't have hours to spend reading, and you say you've been allowed to talk about them.
So can you summarise the business benefits for us, which is what people are asking for, instead of linking to huge articles?
> The cynic in me feels like the former could probably be done by chatgpt off the shelf.
Hello! I'm the owner of the feature in question who experimented with chatgpt last year in the course of building the feature (and working with Hamel to improve it via fine-tuning later).
Even today, it could not work with ChatGPT. To generate valid queries, you need to know which subset of a user's dataset schema is relevant to their query, which makes it equally a retrieval problem as it does a generation problem.
Beyond that, though, the details of "what makes a good query" are quite tricky and subtle. Honeycomb as a querying tool is unique in the market because it lets you arbitrarily group and filter by any column/value in your schema without pre-indexing and without any cost w.r.t. cardinality. And so there are many cases where you can quite literally answer someone's question, but there are multitudes of ways you can be even more helpful, often by introducing a grouping that they didn't directly ask for. For example, "count my errors" is just a COUNT where the error column exists, but if you group by something like the HTTP route, the name of the operation, etc. -- or the name of a child operation and its calling HTTP route for requests -- you end up actually showing people where and how these errors come from. In my experience, the large majority of power users already do this themselves (it's how you use HNY effectively), and the large majority of new users who know little about the tool simply have no idea it's this flexible. Query Assistant helps them with that and they have a pretty good activation rate when they use it.
Unfortunately, ChatGPT and even just good old fashioned RAG is often not up to the task. That's why fine-tuning is so important for this use case.
Thanks for the reply. Huge fan of honeycomb and the feature. Spent many years in observability and built a some of the large in use log platforms. Tracing is the way of the future and hope to see you guys eat that market. I did some executive tech strategy stuff at some megacorp on observability and it’s really hard to unwedge metrics and logs but I’ve done my best when it was my focus. Good luck and thanks for all you’re doing over there.
They think they are real business use cases, because real businesses use them to solve their use cases. They know that chatgpt can't solve this off the shelf, because they tried that first and were forced to do more in order to solve their problem.
There's a summary for ya! More details in the stuff that they linked if you want to learn. Technical skills do require a significant time investment to learn, and LLM usage is no different.
I’ve listed plenty in my comment history. I don’t generally feel compelled to trot them all out all the time - I don’t need to “prove” anything and if you think I’m lying that’s your choice. Finally, many of our uses are trade secrets and a significant competitive advantage so I don’t feel the need to disclose them to the world if our competitors don’t believe in the tech. We can keep eating their lunch.
Processing high volumes of unstructured data (text)… we’re using a STAG architecture.
- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually
- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule
- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.
- Push the result to webhook
Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
Another is preventing LLMs from adding intro or conclusion text.
> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
(Plug) I shipped a dedicated OpenAI-compatible API for this, jsonmode.com a couple weeks ago and just integrated Groq (they were nice enough to bump up the rate limits) so it's crazy fast. It's a WIP but so far very comparable to JSON output from frontier models, with some bonus features (web crawling etc).
We actually built an error-tolerant JSON parser to handle this. Our customers were reporting exactly the same issue- trying a bunch of different techniques to get more usefully structured data out.
> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
How are you struggling with this, let alone as a significant barrier? JSON adherence with a well thought out schema hasn't been a worry between improved model performance and various grammar based constraint systems in a while.
> Another is preventing LLMs from adding intro or conclusion text.
Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.
Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment
Asking even a top-notch LLM to output well formed JSON simply fails sometimes. And when you’re running LLMs at high volume in the background, you can’t use the best available until the last mile.
You work around it with post-processing and retries. But it’s still a bit brittle given how much stuff happens downstream without supervision.
Constrained output with GBNF or JSON is much more efficient and less error-prone. I hope nobody outside of hobby projects is still using error/retry loops.
Constraining output means you don’t get to use ChatGPT or Claude though, and now you have to run your own stuff. Maybe for some folks that’s OK, but really annoying for others.
You're totally right, I'm in my own HPC bubble. The organizations I work with create their own models and it's easy for me to forget that's the exception more than the rule. I apologize for making too many assumptions in my previous comment.
Out of curiosity- do those orgs not find the loss of generality that comes from custom models to be an issue? e.g. vs using Llama or Mistral or some other open model?
I do wonder why, though. Constraining output based on logits is a fairly simple and easy-to-implement idea, so why is this not part of e.g. the OpenAI API yet? They don't even have to expose it at the lowest level, just use it to force valid JSON in the output on their end.
It’s significantly easier to output an integer than a JSON with a key value structure where the value is an integer and everything else is exactly as desired
That's because you've dumbed down the problem. If it was just about outputting one integer, there would be nothing to discuss. Now add a bunch more fields, add some nesting and other constraints into it...
The more complexity you add the less likely the LLM is to give you a valid response in one shot. It’s still going to be easier to get the LLM to supply values to a fixed scheme than to get the LLM to give the answers and the scheme
The best available actually have the fewest knobs for JSON schema enforcement (ie. OpenAI's JSON mode, which technically can still produce incorrect JSON)
If you're using anything less you should have a grammar that enforces exactly what tokens are allowed to be output. Fine Tuning can help too in case you're worried about the effects of constraining the generation, but in my experience it's not really a thing
I only became aware of it recently and therefore haven’t done more than play with in a fairly cursory way, but unstructured.io seems to have a lot of traction and certainly in my little toy tests their open-source stuff seems pretty clearly better than the status quo.
“Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule”
This is really interesting, is there any architecture documentation/articles that you can recommend?
I'm late to this party, but here's a post I wrote about it. This is more motivation but we are working on technical posts/papers for release. Happy to field emails in the meantime if this is timely for you.
We have a company mail, fax, and phone room that receives thousands of pages a day that now sorts, categorizes, and extracts useful information from them all in a completely automated way by LLMs. Several FTEs have been reassigned elsewhere as a result.
It certainly has use cases, just not as many as the hype lead people to believe.
For me:
-Regex expressions: ChatGPT is the best multi-million regex parser to date.
-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.
-Artwork inspiration: Not only for visual inspiration, in the case of image generators, but descriptive as well. The verbosity of some LLMs can help describe things in more detail than a person would.
-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.
I have a friend who uses ChatGPT for writing quick policy statement for her clients (mostly schools). I have a friend who uses it to create images and descriptions for DnD adventures. LLMs have uses.
The problem I see is, who can an "application" be anything but a little window onto the base abilities of ChatGPT and so effectively offers nothing more to an end-user. The final result still have to be checked and regular end-users have to do their own prompt.
Edit: Also, I should also say that anyone who's designing LLM apps that, rather than being end-user tools, are effectively gate keepers to getting action or "a human" from a company deserves a big "f* you" 'cause that approach is evil.
I think it comes down to relatively unexciting use cases that have a high business impact (process automation, RPA, data analysis), not fancy chatbots or generative art.
For example, we focused on the boring and hard task of web data extraction.
Traditional web scraping is labor-intensive, error-prone, and requires constant updates to handle website changes. It's repetitive and tedious, but couldn't be automated due to the high data diversity and many edge cases. This required a combination of rule-based tools, developers, and constant maintenance.
We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.