The association between pathogens and cancer is under-appreciated, mostly due to limitations in detection methods.
For instance, it is not uncommon for cancer studies to design assays around non-oncogenic strains, or for assays to use primer sequences with binding sites mismatched to a large number of NCBI GenBank genomes.
Another example: studies relying on The Cancer Genome Atlas (TCGA), which is a rich database for cancer investigations. However, the TCGA made a deliberate tradeoff to standardize quantification of eukaryotic coding transcripts but at the cost of excluding non-poly(A) transcripts like EBER1/2 and other viral non-coding RNAs -- thus potentially understating viral presence.
A more accurate title: "Are Cornell Students Meritocratic and Efficiency-Seeking? Evidence from 271 MBA students and 67 Undergraduate Business Students."
This topic is important and the study interesting, but the methods exhibit the same generalizability bias as the famous Dunning-Kruger study.
The referenced MBA students -- and by extension, the elites -- only reflect 271 students across two years, all from the same university.
By analyzing biased samples, we risk misguided discourse on a sensitive subject.
Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.
Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.
We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.
If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.
Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.
Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.
This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.
100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.
(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)
> this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth
Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.
That is axiomatically true, but both harsh and useless, given that (as I understand from HN articles and comments) the choice is "play the publishing game as it is" vs "don't be a scientist anymore".
I agree, but there is an important side-effect of this statement: it's possible to criticize science, without criticizing scientists. Or at least without criticizing rank and file scientists.
There are many political issues where activists claim "the science has spoken." When critics respond by saying, "the science system is broken and is spitting out garbage", we have to take those claims very seriously.
That doesn't mean the science is wrong. Even though the climate science system is far from perfect, climate change is real and human made.
On the other hand, some of the science on gender medicine is not as established medical associates would have us believe (yet, this might change in a few years). But that doesn't stop reputable science groups from making false claims.
If we’re not going to hold any other sector of the economy personally responsible for responding to incentives, I don’t know why we’d start with scientists. We’ve excused folks working for Palantir around here - is it that the scientists aren’t getting paid enough for selling out, or are we just throwing rocks in glass houses now?
Valid critique, but one addressing a problem above the ML layer at the human layer. :)
That said, your comment has an implication: in which fields can we trust data if incentives are poor?
For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?
These are hard questions.
ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.
Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."
Not a answer, but contributory idea - Meta-analysis. There are plenty of strong meta-analysis out there and one of the things they tend to end up doing is weighing the methodological rigour of the papers along with the overlap they have to the combined question being analyzed. Could we use this weighting explicitly in the training process?
Thanks. This is helpful. Looking forward to more of your thoughts.
Some nuance:
What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.
Worse, who decides?
To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”
Scientists are responding to the incentives of a) wanting to do science, b) for the public benefit. There was one game in town to do this: the American public grant scheme.
This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]
So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.
If you download data sets for classification from Kaggle or CIFAR or search ranking from TREC it is the same. Typically 1-2% of judgements in that kind of dataset are just wrong so if you are aiming for the last few points of AUC you have to confront that.
I still want to jump off a bridge whenever someone thinks they can use the twitter post and movie review datasets to train sentiment models for use in completely different contexts.
To elaborate, errors go beyond data and reach into model design. Two simple examples:
1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".
2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.
There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.
We need way more people thinking about biomedical AI.
> What was true last year may be false today. For instance, ...
Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.
Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.
What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.
Would not one approach be to just ensure the system has all the data? Relevance to address systems, side effects, and legal constraints. Then when making a recommendations it can account for all factors not just prior use cases.
If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.
Every fact is born an opinion.
This challenge exists in most, if not all, spheres of life.
I think an often overlooked aspect of training data curation is the value of accurate but oblique data. Much of the “emergent capabilities “ of LLMs comes from data embedded in the data, implied or inferred semantic information that is not readily obvious. Extraction of this highly useful information, in contrast to specific factoids, requires a lot of off axis images of the problem space, like a CT scan of the field of interest. The value of adjacent oblique datasets should not be underestimated.
I noticed this when adding citations to wikipedia.
You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.
The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.
Synthetic data generation techniques are increasingly being paired with expert validation to scale high-quality biomedical datasets while reducing annotation burden - especially useful for rare conditions where real-world examples are limited.
I think their question is a good one, and not being taken charitably.
Lets take the medical assistant example.
> Medical assistants are unlicensed, and may only perform basic administrative, clerical and technical supportive services as permitted by law.
If they're labelling data that's "tumor" or "not tumor", with any agency of the process,does that fit within their unlicensed scope? Or, would that labelling be closer to a diagnosis?
What if the AI is eventually used to diagnose, based on data that was labeled by someone unlicensed? Should there there need to be a "chain of trust" of some sort?
I think the answer to liability will be all on the doctor agreeing/disagreeing with the AI...for now.
To answer this, I would think we should consider other cases where someone could practice medicine without legally doing so. For example, could they tutor a student and help them? Go through unknown cases and make judgement, explaining their reasoning? As long as they don't oversell their experience in a way that might be considered fraud, I don't think this would be practicing medicine.
It does open something of a loophole. Oh, I wasn't diagnosing a friend, I was helping him label a case just like his as an educational experience. My completely IANAL guess would be that judges would look on it based on how the person is doing it, primarily if they are receiving any compensation or running it like a business.
But wait... the example the OP was talking about is doing it like a business and likely doesn't have any disclaimers properly sent to the AI, so maybe that doesn't help us decide.
A bit simpler, but if they are training the AI to answer law questions or medical questions (specific to a case, and not general), then that's what I would argue is unlicensed practice.
Of course it's the org and not the individual who would be practicing, as labelling itself is not practicing.
The author is a respected voice in tech and a good proxy of investor mindset, but the LLM claims are wrong.
They are not only unsupported by recent research trends and general patterns in ML and computing, but also by emerging developments in China, which the post even mentions.
Nonetheless, the post is thoughtful and helpful for calibrating investor sentiment.
Agreed. There is deep potential for ML in healthcare. We need more contributors advancing research in this space. One opportunity as people look around: many priors merit reconsideration.
For instance, genomic data that may seem identical may not actually be identical. In classic biological representations (FASTA), canonical cytosine and methylated cytosine are both collapsed into the letter "C" even though differences may spur differential gene expression.
What's the optimal tokenization algorithm and architecture for genomic models? How about protein binding prediction? Unclear!
There are so many open questions in biomedical ML.
The openness-impact ratio is arguably as high in biomedicine as anywhere else: if you help answer some of these questions, you could save lives.
Hopefully, awesome frameworks like this lower barriers and attract more people.
I'd love to hear more of our thoughts re open questions in biomedical ML. You sound like you have a crisp, nuanced grasp the landscape, which is rare. That would be very helpful to me, as an undergrad in CS (with bio) trying to crystalize research to pursue in bio/ML/GenAI.
Thanks, but no one truly understands biomedicine, let alone biomedical ML.
Feynman's quote -- "A scientist is never certain" -- is apt for biomedical ML.
Context: imagine the human body as the most devilish operating system ever: 10b+ lines of code (more than merely genomics), tight coupling everywhere, zero comments. Oh, and one faulty line may cause death.
Are you more interested in data, ML, or biology (e.g., predicting cancerous mutations or drug toxicology)?
Biomedical data underlies everything and may be the easiest starting point because it's so bad/limited.
We had to pay Stanford doctors to annotate QA questions because existing datasets were so unreliable. (MCQ dataset partially released, full release coming).
For ML, MedGemma from Google DeepMind is open and at the frontier.
Biology mostly requires publishing, but still there are ways to help.
After sharing preferences, I can offer a more targeted path.
ML first, then Bio and Data. Of course, interconnectedness runs high (eg just read about ML for non-random missingness in med records) and that data is the foundational bottleneck/need across the board.
More like alarming anecdote. :) Google did a wonderful job relabeling MedQA, a core benchmark, but even they missed some (e.g., question 448 in the test set remains wrong according to Stanford doctors).
For ML, start with MedGemma. It's a great family. 4B is tiny and easy to experiment with. Pick an area and try finetuning.
Note the new image encoder, MedSigLIP, which leverages another cool Google model, SigLIP. It's unclear if MedSigLIP is the right approach (open question!), but it's innovative and worth studying for newcomers. Follow Lucas Beyer, SigLIP's senior author and now at Meta. He'll drop tons of computer vision knowledge (and entertaining takes).
For bio, read 10 papers in a domain of passion (e.g., lung cancer). If you (or AI) can't find one biased/outdated assumption or method, I'll gift a $20 Starbucks gift card. (Ping on Twitter.) This matters because data is downstream of study design, and of course models are downstream of data.
Thank you both for an illuminating thread. Comments were concise, curious, and dense with information. Most notably, there was respectful disagreement and a levelheaded exchange of perspective.
To provide more color on cancers caused by viruses, the World Health Organization (WHO) estimates that 9.9% of all cancers are attributable to viruses [1].
Cancers with established viral etiology or strong association with viruses include:
Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.
My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.
Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
The calculus is even worse for SOTA LLMs.
The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.
Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.
It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
To address the downvotes, this comment isn't guaranteeing OAI's success. It merely notes the remarkably elevated probability of OAI escaping Nadella's grip, which was nearly unfathomable 12 months ago.
Even after breaking free, OAI must still contend with intense competition at multiple layers, including UI, application, infrastructure, and research. Moreover, it may need to battle skilled and powerful incumbents in the enterprise space to sustain revenue growth.
While the outcome remains highly uncertain, the progress since the board fiasco last year is incredible.
Why wouldn't it be? In many ways they've clearly lost the fight. They're much smaller and less supported than the entities they intend to regulate. There is a known revolving door problem between federal and commercial employment. The natural mission of regulating food _and_ drugs is no longer sensible in our current social and political environment.
> The natural mission of regulating food _and_ drugs is no longer sensible in our current social and political environment.
Speaks volumes about the state of the USA given that y'all's regulations on food are so lax that the topic already tanked an agreement with the EU (TTIP) as well as a bilateral agreement with the UK (the one the Brexiteers proclaimed would be possible once Brexit came, but still isn't there).
The most obvious differences are washing eggs, washing chicken carcasses with chlorine and prophylactic (or worse, growth-stimulating) usage of antibiotics. All of that is banned here, but allowed in the US - mostly to mask the horrible sanitary and working conditions in farms and slaughterhouses. I'm not going to act like European slaughterhouses are paradises because they are everything but that, but nowhere near the levels of horror from the US.
When even regulation to prevent the worst of the worst isn't feasible any more, frankly I'd say your system has failed entirely.
Ag lobbies are one thing (and they're pretty problematic as well, not shying away from extortion and some IMHO are even bordering on terrorism), but rest assured our populations absolutely and vocally do not want chlorinated chickens, nor do we want GMO food.
This is a good question and encapsulates the challenges of food and drug regulation.
Yes and no. At certain concentrations, many safe compounds become dangerous in humans.
Even at tiny doses, foods like peanuts may be safe for the vast majority yet lethal for a minority.
Given how devilishly heterogeneous the human race is, the ideal solution provides safety testing at the individual level, not the population level [0]. But this is years away until computational and biological breakthroughs arrive.
[0] Population level is a misnomer. FDA trial sizes below.
Keep in mind that very commonly accepted safe substances such as carrots or water also become lethal at high enough doses, for all humans. It's often stated, and it can be tiring to hear, but the dose really does make the poison.
And context matters. Folks with this or that kidney problems can die from basically nothing while the median bloke doesn't even notice that their piss is a tad darker.
That's an excellent point, and of course individuals can have pretty stark differences in metabolism, with the alcohol flush reaction being a great and highly visible example.
At a high enough dose, sure. But study after study over the last decade has shown coffee to have a positive effect on everything from diabetes to Alzheimers.
Not for me. Coffee, anything caffeinated actually, makes me sick like I have the flu. The older I get, the more sensitive I become. I can't even eat chocolate anymore because of the caffeine content.
Also, I don't think most people drink more than 2 cups of coffee per day -- if I know someone who drinks more than 2 or 3 I'd think they have a problem.
I'm not a medical professional, but my understanding is that at least for the anti Alzheimer effect, it is due to the caffeine, specifically the effect it has on dilating capillaries in the brain.
Genuine question: does that mean when people die after consuming way too much caffeine, it's not because of "caffeine toxicity" per se, but because the effects of the caffeine put too much strain on their body?
Nobody is dying from the caffeine in coffee (like 50-100mg per cup). According to the NIH, LD50 of caffeine is 150-200mg/kg (so say 10g for a small person). That's like 100 cups of coffee. Even with espresso that's hard to imagine.
It would need to be in powder form or concentrated in something far beyond natural levels of coffee/tea/matcha/etc.
1 cup filter coffee can be 170mg or more. And LD50 isn’t really relevant here, even LD1 levels are deadly to hundreds of millions of people. It’s entirely possible for what some might consider a “normal” amount of coffee to be deadly to many. See other comment for espresso calculations.
LD50 is (an estimate) of the 50th percentile (i.e., 50% chance of dying), but that doesn't mean it's linear. It _certainly_ doesn't mean that 1% of people will die at 2% of that value, which I think is what you're implying.
The lowest example of a lethal dose I can find in the literature is 57mg/kg. Caffeine overdoses are so rare that we don't know the true distribution, but it's clearly not the case that millions people will die from a few coffees.
Your other comment calculated the lethal dose as *a gallon of espresso*. That's like 125 shots. That is not a remotely normal amount of coffee. It would take multiple people over an hour to make that much espresso for you.
---
Edit: I can't reply, but "LD1" isn't a group of people and you can't just claim it's 1% of the population. LD50 doesn't imply anything about the population distribution or how it varies by person. It refers to a particular experimental set up (or estimate from a natural experiment) in which 50% of the subjects died after a certain dosage.
For example, the LD50 of falling is ~50ft. Some people will be more susceptible to dying by falling a certain distance than others, but there are many other factors involved and it makes no sense to say someone is in 1% of falling-death-probability.
I agree that LD50 doesn't tell you everything you'd want to know, like the lowest possible dose that might kill someone. There might be people who are extremely sensitive to a substance, or situations in which it's particularly dangerous (in combination with other substances or another health condition, for example). For something safe and widely used like caffeine, I'd expect that the vast majority of people would experience roughly similar toxicity (say, within 2x of the median) with a tiny population of outliers; but you can't just assume that there's 1% of the population that's drastically more sensitive.
That’s not what I was implying at all I have no clue how you arrived at that. I’m saying an LD1 does exist – it’s the dose that would be fatal to 1% of a population (and further a LD0.1 and 0.0001 exist). These doses are lower than the LD50, fatal to millions, and approach what some would consider normal. For instance: https://www.nbcnews.com/news/amp/ncna759716
The cooldown period built into HN is there for a reason: taking time to reflect on messages and do any necessary background research makes for better discussions than impulsively saying whatever is top of mind. I suggest you use this time to understand what an LD50 actually is and how the concept generalizes.
(cc @dang, seeing this growing trend of people misunderstanding the missing reply button and evading the timer via edits, perhaps UI affordances could be developed to better introduce folks to the feature?)
"It's entirely possible" isn't the way to think about estimating risk because it assumes the risk goal is zero (ie any risk > 0 means the outcome is "possible"). A dose greater than LD50 means "more probable than not" of dying, absent additional information, which is a more appropriate framing.
Similarly with caffeine content "can be". All kinds of variables like roasting time affect the dose But the semi-standardized dose for a cup of coffee is about 100mg. Related to your link, they are a much larger cup of coffee for comparison. If you normalize it to the standard coffee size, it comes to 100mg caffeine, so right in line with what would be expected.
Classic bayesian error. The population in question here isn’t the globe, but rather people who have already died from caffeine related causes. Naturally the rate of increased caffeine sensitivity amongst those folks will be different from the population at large.
I never said randomly selection from the global population. The point still holds, even if sampling from only those who have died from caffeine: the LD1 dose, by definition, is still safe for almost all the population. That’s why arguing about LD1 or LD.0001 isn’t particularly useful and comes across as overly pedantic.
Also, FWIW the LD50 can be calculated with censored populations (ie not all subjects have died.) Think about it: if I administer a dose that kills half but leaves the other half living, the LD50 remains unchanged even if I continue increasing the dosage until all have died (or not). LD50 does not require a complete set.
LD50 for espresso is roughly 1 gallon per 50kg body mass. I wouldn’t want to, but I could drink 2 gallons of water without significant issue. If we accept that some people will naturally have a lower tolerance (and that espresso isn’t the strongest drink in the world), it’s not hard to see a caffeine overdose itself being fatal.
(based on 36ml espresso having 110mg caffeine, LD50 caffeine is 150-200mg/kg)
Let’s go back to the OP, which asked about coffee. A quick search shows the LD50 for coffee is about 118 cups. At 6 oz per cup, that’s roughly 21 liters. The LD50 for water is listed as 6 liters (below what you’d drink “without significant issue” btw). So someone is much more likely reach the LD50 for water well before caffeine when drinking coffee.
Are there other caffeine delivery mechanisms that differ? Of course, but that’s not what the OP asked. The question was about the toxicity of coffee. That’s why it’s not worth arguing when something like caffeine powder provides the majority of ODs. Likewise there’s going to be variation in toxicity between individuals but those numbers are intended to be generalizable numbers to a population.
Panera used to sell a drink that contained close to the FDA maximum recommended daily quantity of caffeine, and also allowed free refills. Several people died, and sales were halted after some wrongful death lawsuits.
This isn't to say that caffeine is dangerous. Danger isn't an intrinsic property of a substance but rather an emergent property of the context in which it is used. (This is why the schedules of the Controlled Substance Act are inherently stupid.)
Yes, but I get the impression that a lot of people have a limit on the number of cups they can drink per day (usually below 5), after which they get various symptoms.
For instance, it is not uncommon for cancer studies to design assays around non-oncogenic strains, or for assays to use primer sequences with binding sites mismatched to a large number of NCBI GenBank genomes.
Another example: studies relying on The Cancer Genome Atlas (TCGA), which is a rich database for cancer investigations. However, the TCGA made a deliberate tradeoff to standardize quantification of eukaryotic coding transcripts but at the cost of excluding non-poly(A) transcripts like EBER1/2 and other viral non-coding RNAs -- thus potentially understating viral presence.
Enjoy the rabbit hole. :)