This is a great search engine. I entered "Persian transliteration," since that's what I was working on today. It sent me to the readme for a program written in 1996,[0] which takes a Latin-script transliteration of some Persian text, and generates ASCII art that resembles the way that text would be written in Persian script. Useful? Eh… Delightful? 100%. It would never have occurred to me that such a program would exist.
I've been following the project for a while now, and while I don't use it yet, we need it and more like it to succeed if we ever hope to loosen Google's chokehold on the web.
Best of luck to you and to the project!
I'm curious about a few things:
1. What's your (planned) business model?
2. Have you tried asking for sponsorships, either from companies or individuals? You should have an easy way for people to donate. I'm sure you'd have some support there, especially if your day job situation is unstable.
3. Is it just you working on it right now? Have you considered open sourcing it to get community contributions, or hiring more devs (once donations pick up or maybe someone would be willing to work on it on their free time as you do)? I can imagine that writing a search engine is a gargantuan effort, and doing it alone must be close to impossible.
Dunno. In general I don't have a lot of faith in the profitability of search engines. Ads can work if you're Google-scale, the other option is subscriptions, but in that case, you need to be really good and my search engine just isn't, outside of some areas. That's actually one of my bigger design problems, how to let people understand which queries are likely to be useful. It looks like Google, and people assume it has the affordances of Google. It doesn't, and if you go in with those assumptions, you'll be disappointed.
My model, as far as I've planned one, is just to keep the operation as cheap as possible and subsist on donations and maybe partnerships with other search engines. A big part of what I'm exploring is ways of doing as much as possible with low power hardware. I think rather than indexing 1 billion documents, 90% of which are garbage that will never be a good search result for any query ever, if I can index 100 million 50% of which are potentially good hits, then maybe that goes a decent way.
> 2. Have you tried asking for sponsorships, either from companies or individuals? You should have an easy way for people to donate. I'm sure you'd have some support there, especially if your day job situation is unstable.
I haven't really been fishing for this. I honestly didn't see having to change jobs as I am right now. I do have a donations page from before, but all of this was fairly sudden, so I haven't really gone over that whole process all too much yet.
> 3. Is it just you working on it right now? Have you considered open sourcing it to get community contributions, or hiring more devs (once donations pick up or maybe someone would be willing to work on it on their free time as you do)? I can imagine that writing a search engine is a gargantuan effort, and doing it alone must be close to impossible.
It's been just me up until now. Solo work can be ridiculously efficient when beginning a new project, especially when doing the sort of exploratory programming this has been. I also haven't felt I have had enough bandwidth to manage an open source project. But I am approaching a point where it's becoming a bit much to do all by myself, especially given this isn't my only project.
So I am considering open sourcing it or bringing more people in, just need to think a bit about a good format for such a collaboration. It's relatively high maintenance and requires manual operations to keep going. As it stands, a lot of the code isn't trivially testable, running it (even with few documents) requires large language models and so on.
There is a tasteful spot between FOsS engine and ad-support that it feels like marginila could capture. Would likely need a "ceo" type who wasn't also an MBA type to make this work.
I’ve really enjoyed using Marginalia search over the last few months whenever I’m looking for honest-to-goodness “small-web” content. It’s always a breath of fresh air, provided I don’t look too niche.
I took the opportunity to pitch in a few dollars to the Patreon each month. I hope this time & a bit of extra funds from folks is helpful in finding what direction you feel driven to take search in.
Thanks for making a really neat web thing in a web less and less full of neat things!
Using Marginalia always reminds me just how much we have lost since the golden age (2000-2010) of the internet. Thanks for bringing it back in a small way.
Couldn't HN do a fundraising campaign ? HN according to the wikipedia page has been created as a place to preserve the eternal september [1] https://en.wikipedia.org/wiki/Eternal_September
Marginalia is therefore something we must protect at all cost
Users of non-mainstream browsers are now being required solve CAPTCHAs.^1 If hitting the Cloudflare CAPTCHA for HTML results, it appears JSON results are still available. Rate limit unpublished.
curl https://api.marginalia.nu/public/example.com
1. Or users of mainstream browsers that turn off Javascript. Yikes.
Hope the author finds a new job that satisfies his requirements. Maybe he will get some help from fans of marginalia.nu.
It was nice to see an on premises, non-commercial search engine project. Proof of what's possible and the quality of results that can be achieved on a relatively small budget.
The CAPTCHAS should be limited to only some IP ranges, I saw a spike in botspam a while back and had to dial up the bot mitigation a bit to be able to maintain service. Will reduce it when they give up.
I understand botspam on the content side where they just generate keyword-"rich" pages trying for SEO, but it sounds like they're running searches on your search engine in this case, and I don't understand the purpose of that.
I'm honestly not entirely sure what they are trying to do.
My best guess is possibly it's an attempt at manipulating a suggestions algorithm, but I don't even persist query data so it's a bit of a waste. But maybe they think I'm backed by Google or Bing?
> it can be difficult to find the sort of queries that produce interesting and worthwhile results.
This is a hard problem to solve. I’ve worked at a couple of places that have built products on top of fulltext search provided by Elasticsearch or MySQL. Both companies struggle with the interface - users find it hard to understand, and hard to craft good searches. What’s the state of the art here? Is it Google with a magic text box? What if people need more control?
I think design of search is still sort of stuck in 1998.
The problem is that an empty search box does nothing to tell you what could fruitfully be typed into it. Good type-ahead suggestions can help assist the query formulation a bit, but that's a band-aid at best.
You could reformulate basically any incidence of a search engine returning bad results as the visitor entering a bad query. The emphasis has been almost entirely on improving the capabilities of the search engine to interpret bad queries, rather than working with the interface design to offer the user an intuition for what is a good one.
Realistically you need to approach it from both ends, but I think the work has been fairly lopsided in search and discovery.
Yeah I think I agree with you. The other place I've run into difficulty is explaining results. When you have complex searches and they're excluding or including certain results, you often want to know which part of the search is responsible. Because often searches are long-lived (outside of Google) and people need to tune them over time, but it can be really hard to figure out what to change.
IIRC for ES, at least, "explain" exists, but it's very hard to interpret.
I am a google search fan but I recently started falling back to Marginalia because some technical search queries yield far very good search results. I don't want this search engine to die.
As someone developing alternative search engines myself, I hope the project pulls through. Good luck finding another job that gives you similar freedoms!
In a very different way, I'm also involved in a search-related project. (edited to add: also going solo on my project as well) If you ever want to bounce ideas around, I'd totally be up for that.
Related: you mention other sources than Common Crawl for WARC data. Is there a list of those somewhere?
It's also an all-around useful format as you can produce it from wget and other common tools. But the big reason I'm moving toward something relatively homomorphic to WARCs is to be able to (in the future) publish my own crawls.
Thanks for that link. I've done a bit of work with the Common Crawl data (and proposed moving to ZSTD with a proof of concept and performance metrics in C a few years ago).
I'll send you an email later this weekend to connect.
I wish you well and we welcome what you are doing with marginalia. As you know search needs a shakeup. One vital approach to a real shakeup is true independence of crawler and index. If it's any encouragement, Marc our founder started Mojeek as a hobby project back in 2004.
Best wishes to you, Marginalia developer.
[0]: http://www.payvand.com/gerdsooz/README.html