Are you trying to suggest that this is an example of a planned economy? Maybe you should look at definitions of planned vs market economies. You still have design and regulation in a market economy.
Why would a China or India care if it were a viable treatment? Unless a country wants to use their population as lab rats, it takes money and scientists to actually confirm a treatment is safe and effective.
Right, this result seems meaningless without a human clinician control.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.
The main points of our methodology are:
1) prove that is possible to simulate patients with an LLM. Which we did.
2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
> Right, this result seems meaningless without a human clinician control.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.
Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.
I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.
This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians
It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
Rails tries to more tightly integrate with the front-end which causes a lot of turn over the years. Django projects from 10 years ago are still upgradable in a day or two. Rails does include some nice stuff though, but I much prefer Django's code first database models than Rail's ActiveRecord.
Those Django models are a pain to work with if you have to access the database with any other tool that is not the original Django app. The only sane way to design a database managed by Django models and migrations is not using any inheritance between models or you'll end up with a number of tables, each one adding a few fields. Django ORM will join them for you but you are on your own if you ever have to write queries with some other tool.
I do agree that Rails' asset stuff has been giant pain over the years and has not kept up well. On the other hand, some apps that adopted separate Rails APIs and a separate (for example, React) frontend have been fine. You're right though that their opinion here added more headaches that necessary!
I’d agree in that I never things like Coffeescript but I think that Rails’ frontend solution since 7.0 of Hotwire has been excellent.
Being able to sprinkle just enough JavaScript on server rended HTML works really well. That you can now use it iOS and Android apps too makes it a simpler alternative to React IMHO.
I still much prefer server rendering in a monolith than dealing with GraphQL, backend for frontend and the complexity of micro services and distributed transactions.
BTW Hotwire and Hotwire Native are also options for Django too.
Will there be a large-dose trial? I imagine that’s going to be more difficult, as the weight-loss effect of GLP-1 means you cant include frail patients in your target group. And GLP-1 at high doses also has some unfavourable side-effects: nausea, vomiting, diarrhea, constipation.
It doesn't say if a million drones are going to be purchased from a defense contractor. Hopefully it goes to a commerical US drone company that makes drones for consumers, film, inspections, etc with an order of million military-harden drones from the Goverment. There would an expection they could tool up to many millions in a time of conflict.
Defense contractors already cover small batches of super-specialized drones.
As a long-time Django user, I would not use Django for this. Django async is probably never the right choice for a green-field project. I would still pick FastAPI/SQLAlchemy over Express and PostHog. There is no way 15 different Node ORMs are going to survive in the long run, plus Drizzle and Prisma seem to be the leaders for now.
FastAPI/SQLAlchemy won’t be more scalable than a typical Django setup. The real bottleneck is the threading model, not the few microseconds the framework spends before handing off to user code. Django running under uWSGI with green threads can outperform Go-based services in some scenarios, largely thanks to how efficient Python’s C ABI is compared to Go.
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
If one says, "we don't use an ORM", you will incrementally create helper functions for pulling the data into your language to tweak the data or to build optional filters and thus will have an ad hoc, informally-specified, bug-ridden, slow implementation of half of an ORM.
There is a time and place for direct SQL code and there is a time and place for an ORM. Normally I use an ORM that has a great escape hatch for raw SQL as needed.
I've always used SQL directly since I stopped using ORMs, and it didn't result in a halfway implemented ORM. Maybe back when there was no jsonb for your blob o' fields cases, it was different.
But yeah don't do a high level lang's job in C or C++
The main advantage of an ORM isn’t query building but its deep integration with the rest of the ecosystem.
In Django, you can change a single field in a model, and that update automatically cascades through to database migrations, validations, admin panels, and even user-facing forms in the HTML.
I'd have to try this for myself before judging it. Apple's CoreData tried and miserably failed to do this, and I wasn't fond of the Laravel ORM either, but Django is probably a better example than those.
The steam generator that the fusion generator connects to might be more expensive than solar at this point. That would be even if fusion cost nothing and had infinite amounts of fuel, there would be no customers for its energy on a sunny afternoon.
People are over excited about sodium-ion batteries. They are at least years away from price parity. The super-low numbers floating around are absolute fantasy until production is in the tens of gigawatt hours at least. Their real value is being a hedge on lithium prices. If large battery manufacturer can trivially reconfigure their lines to make sodium-ion batteries, that will be a giant check on large lithium price spikes.
Not until they actually make them "in volume". They could be ramping up volumes for years and years until they hit that price. When they start producing them, I would bet anything the initial run will not be $19/kWh.
Fair enough, I think that price is a while in the future but from another article:
>In the meantime, CATL’s rival BYD said that its sodium-ion batteries have made progress in reducing cost and are already on track to be on par with lithium iron phosphate battery cost next year and even 70% less in the long run. The Chinese battery maker broke ground on a 30 GWh sodium-ion battery factory earlier this year.
It seems if half the questions are political hot button issues. While slightly interesting, this does not represent how these AIs would do on drier news items. Some of these questions are more appropriate for deep-research modes than quick answers since even legitamate news sources are filled with opinions on the actual answers.
reply