Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Whenever someone uses the term "bias" in ML in anything other than its statistical sense (E[\hat y - y]), it's helpful to mentally replace it with "opinions I disagree with" and see if the argument still makes sense.


Well, if the goal is to produce an example of how one might train a language model, I think it's fine to ignore algorithmic bias issues.

But when people are talking about making e.g. AI chatbots to help teach in classes, or customer service bots, I think it's perfectly reasonable to imagine that research effort is useful to ensure the bot isn't acting like Tay did...

(And more broadly, if you're trying to use an ML model to e.g. screen candidates for hiring, you better be damn sure it's not causing discriminatory hiring practices.)


I think it's fine to build NLP models with any desired property you like, including "not leaking vulgar opinions that offend the courtly manners of your society." Makes plenty of sense.

But I do wish people would be more frank and self-aware about these purposes, rather than fig-leafing it as "ethics" or "fairness". By the time you've deliberately omitting the US Congressional Record on grounds of problematicity, it's worth asking yourself whether you're making your bot polite or doing damnatio memoriae.


Bias in the statistical sense is usually E[\hat beta - beta]. By which I mean there’s a specific aspect of this thing I’m trying to get. The whole field of causal inference is based on the fact that if you do things naively, you might mix your signals. Like how linear regression can get you biased or unbiased coefficients in different settings. Sometimes you need something like IV because just plugging in your data will tell you that ambulances are bad because it indicates the patient will more likely die even after conditioning on everything else catalogued.

It’s not opinions I disagree with, it’s aspects and behavior I don’t want, which is the statistical sense.


Bias in prediction, rather than parameter estimation, is a perfectly well established sense of the term. In particular, people doing language modeling are practically never concerned with identifiability, because you can't pick out one weight out of a trillion parameter model and say what it ought to be in the limit of infinite data.


But when people use the term bias in NLP, that’s what they’re talking about. They don’t want an aspect of the model to do something it ought not do. It’s a case of omitted variable bias causing things like the word analogy issues you hear about. Not an issue of bias in predicting the masked word.


If you create a model that outputs text, and that output contains opinions you disagree with, people are still going to act as if you agree, because it was you who created it.

So often "we can't use Reddit because lots of stuff on Reddit is offensive to somebody" does make sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: