LLMs are a key enabling technology to extract real insights from the enormous amount of surveillance data the USA captures. I think it's not an understatement to say we are entering a new era here!
Previously, the data may have been collected, but there was so much that effectively, on average no one was "looking" at it. Now it can all be looked at.
Imagine PRISM, but all intercepted communications are then fed into automatic sentiment analysis by a hierarchy of models. The first pass is done by very basic and very fast models with a high error rate, but which are specifically trained to minimize false negatives (at the expense of false positives). Anything that is flagged in that pass gets fed to some larger models that can reason about the specifics better. And so on, until at last the remaining content is fed into SOTA LLMs that can infer things from very subtle clues.
With that, full-fledged panopticon becomes technically feasible for all unencrypted comms, so long as you have enough money to handle compute costs. Which the US government most certainly does.
I expect attempts to ban encryption to intensify going forward now that it is a direct impediment to the efficiency of such system.
Yep, and that's assuming it is tuned to be reactive rather than tuned to proactively build cases against people, which is something that has been politically convenient in the past
> If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him -Cardinal Richelieu
and which the Vance / Bannon / Posobiec arm of the current administration seems quite keen on, probably as a next step once they are done spending the $170B they just won to build out their partisan enforcement apparatus.
So what are the actions which represent our duties to resist?
* End-to-end encryption (has downsides with regard to convenience)
* Legislation (very difficult to achieve, and can be ignored without the user having a way to verify)
* Market choices (ie, doing business only with providers who refrain from profiteering from illicit surveillance)
* Creating open-weight models and implementations which are superior (and thus forcing states and other malicious actors to rely on the same tooling as everyone else)
* Teaching LLMs the value of peace and the degree to which it enjoys consensus across societies and philosophies. This of course requires engineering what is essentially the entire corpus of public internet communications to echo this sentiment (which sounds unrealistic, but perhaps in a way we're achieving this without trying?)
* Wholesale deprecation of legacy states (seems inevitable, but still possibly centuries off)
NLP was a thing decades before LLMs and deep learning. If one thing, LLMs are a crazy inefficient and costly way to get at it. I really doubt this has anything to do with scaling.
LLMs are unbelievably effective at NLP. Most NLP before that was pretty bad, the only good example I can think of is Alexa, and it was restricted to English.
People pointing out NLP are missing the point — pulling and crafting rules to run effective NLP is time consuming and technical. With an LLM you can just ask it exactly what you want and it interprets. That's the value; and as this deal just proved it's worth the scaling costs.
The point that is missed isn't about LLMs adequacy as a NLP technique, it's that they cost you 10000 times more for the same effect (after the upfront set-up), which is why I have my doubts that they will be used at scale, at the center of some large data ingestion pipeline. The benefit will probably be for the out of ordinary tasks and outliers.
LLMs make counting mistakes like forgetting the number of columns halfway through. I won't say "much like humans", since that will probably trigger some. But the general tendency for LLMs to be "bad at counting" (this includes computing) is resolved by producing programs that do the counting, and executing those programs instead. The LLMs that do that today are called agentic.
The reality is that for any meaningful work automation, the currently available tooling is not meeting that expectation.
And 99% of us do not have the capabilities nor knowledge to build these SOTA models which is why A. we are not at OpenAI making 10M+ TC and B. We are application developers who are using off the shelf technology to build products and services.
As such, we have real world experience with these technologies.
BTW I use AI heavily every day in cursor and whatever else.
This is even more terrifying, imagine an AI making up all sorts of "facts" about you that puts you on a watch list, resulting an endless life of harassment by the Government..
and what recourse do you have as a citizen? next to none.
LLMs don't make for a particularly good database, though. The "compression" isn't very efficient when you consider that e.g. the entirety of Wikipedia - with images! - is an order of magnitude smaller than a SOTA LLM. There are no known reliable mechanisms to deal with hallucinations, either.
So, no, LLMs aren't going to replace databases. They are going to replace query systems over those databases. Think more along the lines of Deep Research etc, just with internal classified data sources.
You're right, "subsume" would be a better word here. Although vector search is also a thing that I feel should be in the AI bucket. Especially given that SOTA embedding models are increasingly based on general-purpose LLMs.
arent they complete trash as a database? "Show me people who have googled 'Homemade Bomb' in the last 30 days". For returning bulk data in a sane format it is terrible.
If their job was to process incoming data into a structured form I could see them being useful, but holy cow it will be expensive to in realtime run all the garbage they pick up via surveillance through an AI.
Previously, the data may have been collected, but there was so much that effectively, on average no one was "looking" at it. Now it can all be looked at.