More

megaman821 · 2025-12-29T14:32:06 1767018726

Are you trying to suggest that this is an example of a planned economy? Maybe you should look at definitions of planned vs market economies. You still have design and regulation in a market economy.

megaman821 · 2025-12-18T01:55:05 1766022905

Why would a China or India care if it were a viable treatment? Unless a country wants to use their population as lab rats, it takes money and scientists to actually confirm a treatment is safe and effective.

cluckindan · 2025-12-18T07:51:41 1766044301

Obviously you use the neighboring country’s population, or an ethnic minority, or prisoners, or orphans, or…

megaman821 · 2025-12-10T15:26:48 1765380408

Did the real clincians get all 6's in this test?

crazygringo · 2025-12-10T15:35:40 1765380940

Right, this result seems meaningless without a human clinician control.

I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.

Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.

Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.

RicardoRei · 2025-12-10T15:44:38 1765381478

The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.

The main points of our methodology are: 1) prove that is possible to simulate patients with an LLM. Which we did. 2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.

Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study

crazygringo · 2025-12-10T15:48:30 1765381710

Got it, thank you.

So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.

But not meaningful in terms of comparing LLMs with human clinicians.

So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?

How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?

I don't understand where the negative framing of your title is coming from.

RicardoRei · 2025-12-10T16:13:01 1765383181

Again, these things don't depend on each other.

LLMs have room for improvement (we show that their scores are medium-low on several dimensions).

Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.

the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...

We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them

crazygringo · 2025-12-10T16:23:48 1765383828

But you chose the word "struggle". And now you say:

> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.

That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?

Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.

I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.

arisAlexis · 2025-12-10T22:24:45 1765405485

Yes exactly. Seems like there is an agenda against LLMs acting as therapists.

palmotea · 2025-12-10T17:27:38 1765387658

> Right, this result seems meaningless without a human clinician control.

> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.

Does it really matter? Per the OP:

>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).

I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).

Also my impression is BetterHelp pays poorly and thus tends to have less skilled and overworked therapists (https://www.reddit.com/r/TalkTherapy/comments/1letko9/is_bet..., https://www.firstsession.com/resources/betterhelp-reviews-su...), e.g.

> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.

So taking it as a baseline would bias any experiment against human therapists.

crazygringo · 2025-12-10T18:21:53 1765390913

> Does it really matter?

Yes, it absolutely does matter. Look at what you write:

> I'd assume

> it's reasonable to expect

The whole reason to do a study is to actually study as opposed to assume and expect.

And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.

And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.

JoblessWonder · 2025-12-10T18:19:49 1765390789

I love how the top comment on that Reddit post is an *affiliate link* to an online therapy provider.

palmotea · 2025-12-11T05:40:25 1765431625

> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.

Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.

fragmede · 2025-12-11T03:54:28 1765425268

oh no. someone might make money. we can't let other people succeed. someone stop them!

JoblessWonder · 2025-12-11T05:34:35 1765431275

I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.

nradov · 2025-12-10T15:54:28 1765382068

Yes, text chat is one of the communication options for BetterHelp (and some of their competitors).

RicardoRei · 2025-12-10T15:40:31 1765381231

This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians

megaman821 · 2025-12-10T15:48:48 1765381728

It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.

Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.

megaman821 · 2025-12-05T00:53:09 1764895989

Rails tries to more tightly integrate with the front-end which causes a lot of turn over the years. Django projects from 10 years ago are still upgradable in a day or two. Rails does include some nice stuff though, but I much prefer Django's code first database models than Rail's ActiveRecord.

pmontra · 2025-12-05T06:55:30 1764917730

Those Django models are a pain to work with if you have to access the database with any other tool that is not the original Django app. The only sane way to design a database managed by Django models and migrations is not using any inheritance between models or you'll end up with a number of tables, each one adding a few fields. Django ORM will join them for you but you are on your own if you ever have to write queries with some other tool.

grim_io · 2025-12-05T08:11:46 1764922306

I never had that experience.

Django does nothing special compared to the way I would design my db tables the completely manual way.

elzbardico · 2025-12-05T22:07:33 1764972453

Should not be using inheritance with persistent entities anyway. OOP is not about creating taxonomies.

BowBun · 2025-12-05T02:31:44 1764901904

I do agree that Rails' asset stuff has been giant pain over the years and has not kept up well. On the other hand, some apps that adopted separate Rails APIs and a separate (for example, React) frontend have been fine. You're right though that their opinion here added more headaches that necessary!

Lio · 2025-12-05T08:44:38 1764924278

I’d agree in that I never things like Coffeescript but I think that Rails’ frontend solution since 7.0 of Hotwire has been excellent.

Being able to sprinkle just enough JavaScript on server rended HTML works really well. That you can now use it iOS and Android apps too makes it a simpler alternative to React IMHO.

I still much prefer server rendering in a monolith than dealing with GraphQL, backend for frontend and the complexity of micro services and distributed transactions.

BTW Hotwire and Hotwire Native are also options for Django too.

megaman821 · 2025-11-25T18:15:39 1764094539

I can't remember the source, but I think this only rules out small-dose, oral. There will still be a trial with large-dose, injectable.

Mariehane · 2025-11-25T21:58:10 1764107890

Will there be a large-dose trial? I imagine that’s going to be more difficult, as the weight-loss effect of GLP-1 means you cant include frail patients in your target group. And GLP-1 at high doses also has some unfavourable side-effects: nausea, vomiting, diarrhea, constipation.

astrange · 2025-11-26T04:25:33 1764131133

Indeed, and it's literally not Ozempic but Rybelsus.

burnt-resistor · 2025-11-25T20:18:23 1764101903

For reference in typical maintenance doses:

- Rybelsus (type-2 diabetes; oral semaglutide 14 mg/day; much less bioavailability)

- Wegovy (obesity; injectable semaglutide 2.4 mg/wk)

- Ozempic (type-2 diabetes; injectable semaglutide 1.0 mg/wk)

megaman821 · 2025-11-11T16:04:49 1762877089

It doesn't say if a million drones are going to be purchased from a defense contractor. Hopefully it goes to a commerical US drone company that makes drones for consumers, film, inspections, etc with an order of million military-harden drones from the Goverment. There would an expection they could tool up to many millions in a time of conflict.

Defense contractors already cover small batches of super-specialized drones.

megaman821 · 2025-11-03T17:14:19 1762190059

As a long-time Django user, I would not use Django for this. Django async is probably never the right choice for a green-field project. I would still pick FastAPI/SQLAlchemy over Express and PostHog. There is no way 15 different Node ORMs are going to survive in the long run, plus Drizzle and Prisma seem to be the leaders for now.

ccanassa · 2025-11-05T11:32:06 1762342326

FastAPI/SQLAlchemy won’t be more scalable than a typical Django setup. The real bottleneck is the threading model, not the few microseconds the framework spends before handing off to user code. Django running under uWSGI with green threads can outperform Go-based services in some scenarios, largely thanks to how efficient Python’s C ABI is compared to Go.

raverbashing · 2025-11-03T18:01:42 1762192902

Agree

Django is great but sometimes it seems it just tries to overdo things and make them harder

Trying to async Django is like trying to do skateboard tricks with a shopping cart. Just don't

liquidpele · 2025-11-03T23:50:35 1762213835

Async in Python in general really… The way they implemented it is basically a valley of footguns and broken dreams.

morshu9001 · 2025-11-03T23:25:50 1762212350

I'd skip the ORM, Postgres is already designed for direct use on backends.

eYrKEC2 · 2025-11-05T00:20:06 1762302006

Much as Lisp'ers say,

    Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

If one says, "we don't use an ORM", you will incrementally create helper functions for pulling the data into your language to tweak the data or to build optional filters and thus will have an ad hoc, informally-specified, bug-ridden, slow implementation of half of an ORM.

There is a time and place for direct SQL code and there is a time and place for an ORM. Normally I use an ORM that has a great escape hatch for raw SQL as needed.

morshu9001 · 2025-11-05T01:08:59 1762304939

I've always used SQL directly since I stopped using ORMs, and it didn't result in a halfway implemented ORM. Maybe back when there was no jsonb for your blob o' fields cases, it was different.

But yeah don't do a high level lang's job in C or C++

ccanassa · 2025-11-05T11:36:13 1762342573

The main advantage of an ORM isn’t query building but its deep integration with the rest of the ecosystem.

In Django, you can change a single field in a model, and that update automatically cascades through to database migrations, validations, admin panels, and even user-facing forms in the HTML.

morshu9001 · 2025-11-09T17:38:16 1762709896

I'd have to try this for myself before judging it. Apple's CoreData tried and miserably failed to do this, and I wasn't fond of the Laravel ORM either, but Django is probably a better example than those.

remify · 2025-11-04T09:08:40 1762247320

ORMs are mostly useless they make easy queries easier et hard query a lot harder.

mrits · 2025-11-04T03:43:26 1762227806

I'd skip the ORM as well but your reasoning doesn't make any sense

megaman821 · 2025-10-23T22:32:18 1761258738

The steam generator that the fusion generator connects to might be more expensive than solar at this point. That would be even if fusion cost nothing and had infinite amounts of fuel, there would be no customers for its energy on a sunny afternoon.

megaman821 · 2025-10-23T13:43:39 1761227019

People are over excited about sodium-ion batteries. They are at least years away from price parity. The super-low numbers floating around are absolute fantasy until production is in the tens of gigawatt hours at least. Their real value is being a hedge on lithium prices. If large battery manufacturer can trivially reconfigure their lines to make sodium-ion batteries, that will be a giant check on large lithium price spikes.

tim333 · 2025-10-23T14:00:53 1761228053

>CATL has announced battery pricing at the cell level in volume at $19/kWh.

Sounds price competitive already?

megaman821 · 2025-10-23T14:28:42 1761229722

Not until they actually make them "in volume". They could be ramping up volumes for years and years until they hit that price. When they start producing them, I would bet anything the initial run will not be $19/kWh.

tim333 · 2025-10-23T14:43:45 1761230625

Fair enough, I think that price is a while in the future but from another article:

>In the meantime, CATL’s rival BYD said that its sodium-ion batteries have made progress in reducing cost and are already on track to be on par with lithium iron phosphate battery cost next year and even 70% less in the long run. The Chinese battery maker broke ground on a 30 GWh sodium-ion battery factory earlier this year.

megaman821 · 2025-10-22T16:16:09 1761149769

It seems if half the questions are political hot button issues. While slightly interesting, this does not represent how these AIs would do on drier news items. Some of these questions are more appropriate for deep-research modes than quick answers since even legitamate news sources are filled with opinions on the actual answers.