Hacker Newsnew | past | comments | ask | show | jobs | submit | more danabramov's commentslogin

For me, the scary moment was seeing Grokipedia show up as one of the “sources” in a random Claude query a few days ago. Even if people don’t explicitly choose to use it, poisoning the well is working.


Yeah I would love it if I could put in some guardrails for this type of stuff, eg never use Grokipedia or Reddit as authoritative sources (I currently put that in my personalization prompt).


Why not Reddit? Of course, it's far from a paragon of truth, and it's getting more manipulated and overtaken by bots by the minute, but adding "reddit" to my search queries (in search engines, that is) had always been my go-to for finding answers and threads by actual people, especially if we're talking about technical advice or getting an opinion on something. It's like finding forum threads on something I'm interested in.


“Data voids”


[flagged]


It's very weird to assume good intentions or trustworthy info from Grokipedia but then hold up Wikipedia as "heavily poisoned". Your questions are based on a lot of assumptions that aren't widely shared.


I’m leaning paid shill tbh.


It's democracy (as flawed as that can be) vs. one individual billionaire's personal AI.


Nothing about Wikipedia is democratic.


>> "For many people it's obvious"

[who?][citation needed]

https://en.wikipedia.org/wiki/Template%3AWho

Having standards for this stuff matters even if you don't always like the result.[0]

[0] https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Word...

>> "Weasel words are words and phrases aimed at creating an impression that something specific and meaningful has been said, when in fact only a vague or ambiguous claim has been communicated. A common form of weasel wording is through vague attribution, where a statement is dressed with authority, yet has no substantial basis. Phrases such as those above present the appearance of support for statements but can deny the reader the opportunity to assess the source of the viewpoint. They may disguise a biased view. Claims about what people say, think, feel, or believe, and what has been shown, demonstrated, or proved should be clearly attributed."


Citation: go look at how many people are submitting edits to the articles. That is "many" people.


It's not my job to provide support for your point. Examples. It's trivial to cite them if they're in the edit log.


How very Wikipedian of you.

Go to grokipedia.com and look at the edit log yourself. It's two clicks. If you care about truth you'll do it.


Bluesky doesn’t ask them to understand DNS, it just gives them a free subdomain to start with. This isn’t very different from how Gmail gives you a gmail.com address. But you can also move it to your own domain later and obviously it’s possible to build user-friendly interfaces for that.


Right, that's exactly my point; Bluesky provides a centralized alternative to using your own domain name. (And moving from a did:plc to something decentralized is no easier than moving from mastodon.social or similar large instance to a smaller one!)

To be clear, I actually think it's a good idea to let people associate their own domains with their accounts, but I find it frustrating that people act like ATProto is the only or first example of open social protocols, as this TFA does.


Correct me if I'm wrong but I think another important aspect of AT is Lexicons, i.e. there's an officially suggested way to do schemas, and application authors are encouraged to create and distribute schemas for their apps. Data is grouped in the repo by its schema too.


True. Scuttlebot had msg types too.


I think partially it's because momentum is picking up in the AT ecosystem.

Since all data lives in a single conceptual space, you start seeing community services like https://constellation.microcosm.blue/ (backlinks without running your own index), https://slices.network/ (indexes data you want and gives you a GraphQL/REST endpoint), independent relays (https://atproto.africa/), and so on.

To give you an example, https://slices-teal-relay.bigmoves.deno.net/ is a demo of Slices showing the latest teal.fm records (like Last.fm scrobbles). The thing is, teal.fm is not even launched as an app. It's just its developers already listen to music through it, the records get created in their repos, and so any other developer can aggregate their records and display them.

It's a web of realtime hyperlinked JSON created by different apps. That's exciting.


I wonder if one could perhaps create an RSS reader app by piggybacking on ATProto purely for broadcasting change events, with the feed themselves still located at their traditional address? Basically reviving Google Reader on ATProto infra.


thank you for sharing these links, if it isn't too much to ask, could you or someone reading this compile a list of useful / interesting ATproto endpoints for HN?


I wrote an intro to AT that should be broadly accessible and states the problems before the solutions. You might find it helpful: https://overreacted.io/open-social/


The author wanted to take down their account (to take a break) so this is actually working as designed. The takedown was issued from the author’s repository (which they control), and the downstream app server acknowledged the request.


AT model is very different from Mastodon or email. It’s much closer spiritually to RSS and plain old web.

Mastodon is “many copies of the same app emailing each other”. There’s no global shared view of the network so you can’t have features like globally accurate like counts, shared identity, global search, algorithmic feeds across instances, etc.

On the other hand, in AT, the idea is just that apps aggregate information from different repos. So each application’s server has information aggregated from the entire network. Everybody sees the same consistent information; apps exist to separate experiences rather than communities.

For example, Tangled (https://tangled.org) and Leaflet (https://leaflet.pub) are AT apps, but they’re nothing similar to “mastodon servers”. These are complete apps that implement different experiences but on the same global network.

Crucially, normal people don’t need to “buy into” the protocol stuff with AT. Most Bluesky users don’t know what AT is and don’t care about it; they’re just using the app. There’s interesting crossovers you can do (each AT app sees each other AT app’s public data) which do bleed into the user experience (eg my Tangled avatar is actually populated from Bluesky) but overall apps compete on their merit with centralized apps.

Hope that makes sense. See https://overreacted.io/open-social/ for a longer article I wrote about AT with visual explanations.


> It’s much closer spiritually to RSS and plain old web

What do you mean by this? ATProto requires a giant indexing database that has access to every post in the network. Mastodon is more like a feed reader—you only get notified about the posts you care about. How is needing a giant database that knows about every RSS feed in the world closer to the plain old web?


>What do you mean by this?

RSS is a way to aggregate data from many sites into one place. AT lets you do the same, but with bells and whistles (the data is signed and typed, and there's a realtime stream in addition to pulling on demand). If you're forced to describe AT via existing technologies, AT is basically like RSS for typed JSON in Git over HTTP or WebSockets that scales to millions of users.

It is completely up to you what you decide to index. If you want to build an app that listens to records of "Bluesky post" type that are created only by people you follow, you absolutely can.

See https://bsky.app/profile/why.bsky.team/post/3m2fjnh5hpc2f (which runs locally and indexes posts relevant to you) and https://reddwarf.whey.party/ (which doesn't have a database at all and pulls data from original servers on demand + using https://constellation.microcosm.blue/ for some queries).

The reason you don't see more of these is because an isolated experience is... well, isolated. So people are less interested in running something like this compared to, say, a whole new AT app. But AT can scale down to Mastodon-like use cases too.

>ATProto requires a giant indexing database that has access to every post in the network.

Only if you want to index every post, i.e. if you want to run a full-scale social app for millions of users. As an app builder, you get to choose what you index.

For a start, you probably only want to store the records relevant to your app. For example, I doubt that Tangled (https://tangled.org/), which is an AT app, has a database with every Bluesky post. That seems absurd because Tangled is focused on a completely different use case — a social layer around Git. So Tangled only indexes records like "Tangled repo", "Tangled follow", "Tangled star", and so on.

Naturally, Tangled wants to index all posts related to Tangled — that's just how apps work. If you wanted to build a centralized app, you'd also want it to contain the whole database of what you want the app to show. This isn't specific to AT, that's just common sense—to be able to show every possible post on demand with aggregated information (such as like counts), you have to index that information, hit someone else's index, or fetch posts from the source (but then you won't know the aggregated like counts).

That said — if you want to build a copy of a specific app (like Bluesky) but filtered down to just the people you follow (with no global search, algorithmic feeds, etc), you absolutely can, as I've linked earlier. Or you can build something hybrid relying on global caches, or some other subset of the network (say, last 2 weeks of posts). How you do indexing is up to you. You're the developer here.


> The reason you don't see more of these is because an isolated experience is... well, isolated.

I don't understand why you become isolated once you've built your own app, it it because the bluesky firehouse has to decide to index posts I make on my server? I guess I'm asking how does an application decide which sources to index from, just anyone advertising that they are serving that lexicon? Why then would I become isolated by virtue of hosting only data I want to host/indexing only feeds I care to index?

(Thanks in advance I do want to grok this...)


Hmm, no, that’s not what I meant. Let me try to break it down a bit.

There’s really two main kinds of nodes in the system. Hosting servers and app servers. They’re completely unrelated and completely decoupled. It’s like Dropbox vs apps that put data in your Dropbox.

A hosting server stores your personal data. This is similar to having a Git repository with data from all social apps. Or like a Dropbox folder. That’s usually called a “PDS” — a personal data server. Running one is extremely cheap since it’s only your data. It is also optional (eg Bluesky provides AT hosting for free). But this is not an app — it’s literally like Git hosting. Just the data (for all apps).

Then you have app backends. Those are just normal servers. They’re what you’d typically think of web applications. The Bluesky app is one of them. An application server listens to events from all known hosting servers and updates its local database with whatever it’s interested in from the stream. For example, the Bluesky application server updates its local database to put all “post created”, “like created” etc events from all hosting servers into its database that it can query.

So as an app author you have a lot of freedom for what to build:

- You can build a new app that only listens to record of your app’s type. So naturally it would only index your app’s users’ content. Which is presumably not much.

- You can take an app server for existing app (if it’s open source) and run it yourself. But then of course if this app has a million of users, you need to decide which records you want. Do you want to index them all (like the original app)? Do you want to index a subset? Which subset? It could be historical (eg two last weeks of post, one last week of likes etc). Or it could be by proximity (only profiles, posts and likes within one follow from you). Or something else. You decide what to store.

- You can also build something hybrid — an app that remixes data from multiple apps. And you can fetch data from hosting servers without storing it (but this doesn’t give you aggregation) or fetch aggregated data from community indexes (if the aggregation you need already exists and is provided by someone else).

Hope this makes sense.

(As a performance optimization, instead of aggregating from millions of repositories individually, you’d listen to a stream that combines them. That’s called “relay”. Relays are mostly dumb websocket retransmitters and don’t have any app-specific logic. Bluesky runs one, Blacksky runs their own, and it would generally cost $30/mo to run one today. Any hosting server can ask any relay to crawl it. Any relay may also choose to crawl a new hosting server if it encounters links to content on that server. Relays are common infra and you shouldn’t expect there to be a lot of them. App servers choose which relay to listen to, if at all.)

---

Now answering your specific questions:

>I don't understand why you become isolated once you've built your own app

If you've built an app that looks like Bluesky, but only you and your friends' posts/likes show up, is that much better than just using Bluesky? My point is that usually this isn't a differentiator and feels kind of pointless. You might as well just curate your Following feed on Bluesky. So people don't do that often.

>it it because the bluesky firehouse has to decide to index posts I make on my server?

This seems like a misconception; moving your data (to your own hosting) is a completely separate thing from creating an app. See the distinction above. You can move your hosting to a different hosting server, but this wouldn't affect your experience in the Bluesky app at all. The Bluesky application server would simply start ingesting your posts from your new server instead once it gets notified about your account move.

>I guess I'm asking how does an application decide which sources to index from, just anyone advertising that they are serving that lexicon?

Typically an application just listens to a relay (like the one hosted on Bluesky) which already retransmits events from all known repositories. If you operate your own repository, you can send a "request crawl" command to Bluesky's relay, and it will index you. This is kind of similar to a website getting picked up by Google search. Links may also do it but a "request crawl" is the explicit way. See https://pdsls.dev/jetstream?instance=wss%3A%2F%2Fjetstream1.... for a live feed of the relay operated by Bluesky (it's not specific to the Bluesky app).

>Why then would I become isolated by virtue of hosting only data I want to host/indexing only feeds I care to index?

Hosting data !== indexing, again these are separate things.

Hosting your own data doesn't make you isolated — it is pretty much indistinguishable in the apps. You don't see where someone's data is hosted since in the app it all appears seamlessly aggregated.

Creating an app that only shows 0.000001% of the network's content when there's already an app that shows 100% of the same content is what I call isolating. I'm just not sure what it accomplishes since the network is still shared. So this isn't very compelling to most app builders. What's compelling is usually building completely new experiences. Although some people do experiment with more "limited" Bluesky clones.


Thanks for the patient explanation. It surprises me that an aggregator would simply start distributing from any server that announces it has content for that application. Moderation without false positives must be a beast.


The way I think about it, ingesting a stream of records from an arbitrary server is not any different to ingesting a series of <form> POST requests from someone’s computer. It doesn’t make moderation different.

Moderation in AT is layered. Hosting servers do their own moderation but it’s very minimal (just trying to catch illegal content early). Relay operators also have levers to stop broadcasting from specific nodes if they’re problematic (but again, this is reserved for either extreme illegal content or for network abuse). Most of what you’d think as moderation happens at the app server level, which is the same as in non-AT apps. The app server can easily choose to not serve a certain user’s posts even if they exist upstream at their hosting.

One wrinkle is that AT goes a step further and extracts moderation primitives (“labelers”) as a separate thing — for example, you can ingest Bluesky’s moderation decisions from a separate service (and the Bluesky app server listens to the same service). This makes moderation composable, and also lets someone make a fork of Bluesky that “listens” to a different moderation authority.


This would make sense if there weren't so many features—like Blocks, DMs, followers-only posts, etc—that were reliant on the AppView enforcing a single global view of the world. I agree that I do think the AT model does have good properties but right now too much of it is reliant on this single shared global app view

But thanks for the link to Konbini! That looks really exciting and promising and I would love to start using it if I can run it completely decoupled from Bluesky infrastructure.


I think it's only reliant on them to the extent that you want to build copies of the same exact experience, which I personally don't find very interesting. I think a much more compelling story is not, say, "a clone of Bluesky with a Bluesky DM folder", but, say, "a Spaces-like product that closely integrates with Bluesky (for posting) and is also listed as a stream on on Streamplace".

I agree that some information seems important to know, like blocks. (Although in different apps it's reasonable to expect blocks to be app-specific.) Blocks are public on Bluesky though, for this exact reason. DMs are a disconnected service but the eventual idea is some kind of E2E (https://www.germnetwork.com/ is also building something now). Follower-only things could work through some variation of private state mechanism (see https://pfrazee.leaflet.pub/3lzhmtognls2q, https://pfrazee.leaflet.pub/3lzhui2zbxk2b).

>I would love to start using it if I can run it completely decoupled from Bluesky infrastructure.

You could use Blacksky's relay as the input source (https://atproto.africa/), or run your own relay. The only piece you'd then depend on is PLC registry (since it resolves PLC identity). Bluesky is in the process of separating it into a separate entity in Switzerland, but if that's a hard goal, I guess you could forbid `did:plc` identities in your app (vast majority of users) and only ingest data about `did:web` ones? Or do you feel OK about PLC resolution?


i'm very curious about tangled. i'm building a new thing (tl;dr: an e2e testing and monitoring service) and hope to add more distributed/decentralized functionality into its core. i had been leaning heavily towards using nostr at the core, but it's nice to see atproto-based examples i can learn from, too.


What questions do you have about AT? I agree its docs are mostly “bad” and hard to understand. I find the actual tech approachable so happy to answer more concrete questions.

Tools like http://pdsls.dev in particular can be helpful to see how things fit together.


i think it really is as simple as boiling it down into a doc that looks like nip-1 and saying, "this is the absolute minimum amount you need to understand and implement to start sending messages on an AT-based network." -- not from a user perspective, but from an average developer perspective.

i know eventually i'd need to implement a ton more than the absolute bare minimum, but my gut-feeling "average developer brain" says nostr's absolute minimum feels smaller that AT's absolute minimum. i guess i'm looking for an AT doc for devs that shows the absolute minimum for creating a client that is equally approachable as NIP-1.


Thanks, that’s helpful. I’ll see if I can write something in that spirit later.


thanks. also, fwiw, i'm also a very a happy AT user (@hugs.bsky.social) besides also being a happy nostr user.

i appreciate bsky's focus on user ux and community building and look forward to seeing more sharing of ideas between nostr and AT.

edit to add: to nerd-snipe my brain into wanting to make stuff with AT (or any future protocol) is to focus on a quick-start or tutorial showing the absolute minimal client to send one message.

once i can do that... i'm ready to learn all the rest of the vocabulary and server-side stuff, but not until i can send one simple message from a barely functional minimal client.


Would it be ok to use a library or is the requirement to keep it to raw primitives like curl?


i like seeing a bit of the raw, low-level protocol first. a few curl examples are perfect for understanding what’s really happening under the hood. once i get that, i'm happy to use a library to handle all the edge cases.

but starting with a library tutorial makes me wonder how many stacks of turtles are being hidden. if i can see the turtles upfront, i'll appreciate what the library does for me -- and i'll have a better sense of how to debug when things break.


Absolutely. I think it’s a great constraint actually. I have a few other pieces in the backlog but I’ll keep this one in mind.

This isn’t quite what you want but should illuminate at least the “fetch on demand” part in detail: https://overreacted.io/where-its-at/


yeah, that looks like a good base for a simplified remix. thanks!


everything we need to authenticate the msg should exist in the msg. sig => timestamp, hash


The docs are bad, sadly that’s true.



Having a hard time understanding what that website is showing (eg. what is a lexicon?). Could you explain?

Is it meant to be some kind of filtered sampling from the stream of ATProto events?


Yes.

Think of these as names of tables (or collections) in a distributed database. Or as type definitions. Or as app-defined data formats.

Each lexicon is a schema for a model. So you’re looking at a list of such “types” — a “repo”, a “follow”, a “star” etc.

There’s a “Tangled repo” lexicon, a “Bluesky post” lexicon, a “Leaflet publication” lexicon, and so on. Lexicons are specified and evolved by developers of each concrete application. Other apps can use those type definitions to read or write that kind of data.

See https://www.pfrazee.com/blog/why-not-rdf#lexicon for a short intro.

The UFO tool samples the global event stream and keeps stats on which lexicons are showing up in it (i.e. types of JSON that are being created on the network). You can expand the “samples” tab to show a few concrete JSON blobs so you get the idea of what the data represents.


Note that it's CBOR being sent over WebSockets (afaik).


The firehose is CBOR, jetstream is JSON (which more people use)

I was mainly speaking more generally than atproto. Most APIs talk JSON these days


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: