Hacker Newsnew | past | comments | ask | show | jobs | submit | abelanger's commentslogin

I mentioned this towards the bottom of the post, but to reiterate: we're extremely grateful to Laurenz for helping us out here, and his post on this is more than worth checking out: https://www.cybertec-postgresql.com/en/partitioned-table-sta...

(plus an interesting discussion in the comments of that post on how the query planner chose a certain row estimate in the specific case that Laurenz shared!)

The other thing I'll add is that we still haven't figured out:

1. An optimal ANALYZE schedule here on parent partitions; we're opting to over-analyze than under-analyze at the moment, because it seems like our query distribution might change quite often.

2. Whether double-partitioned tables (we have some tables partitioned by time series first, and an enum value second) need analyze on the intermediate tables, or whether the top-level parent and bottom-level child tables are enough. So far just the top-level and leaf tables seem good enough.


I'd consider myself pretty familiar with postgres partitioning, and even worked with systems that emulated partitioning through complex dynamic SQL through stored procs before it was supported natively.

But TIL, I didn't realize you could do multiple levels of partitioning in modern postgres, found this old blog post that touches on it https://joaodlf.com/postgresql-10-partitions-of-partitions.h...

Something that stresses me is the number of partitions - we have some weekly partitions that have a long retention period, and whilst it hasn't become a problem yet, it feels like a ticking time bomb as the years go on.

Would a multi level partitioning scheme of say year -> week be a feasible way to side step the issues of growing partition counts?


Sure, I'll bite. Task-level idempotency is not the problem that durable execution platforms are solving. The core problem is the complexity that arises when one part of your async job becomes distributed: the two common ones are distributed runtime (compute) and distributed application state.

Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.

But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.

It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.

There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.


The "constraining functions to only be durable" idea is really interesting to me and would solve the main gotcha of the article.

It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.

There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.


> It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support...I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link

Trigger.dev currently uses CRIU, but I recall reading on HN (https://news.ycombinator.com/item?id=45251132) that they're moving to MicroVMs. Their website (https://feedback.trigger.dev/p/make-runs-startresume-faster-...) suggests that they're using Firecracker specifically, but I haven't seen anything beyond that. It would definitely be interesting to hear how it works out, because I'm not aware of another Durable Execution platform that has done this yet.


Ok, I'm not an expert here, you most likely are, but just my 2 cents on your response: I would very much argue to not make this magic. e.g:

> take memory snapshots after each step in a workflow

Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.

The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.

The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.

Please steal my startup.


FWIW, I think the memory snapshotting idea isn't going to work for most stacks for a few different reasons, but to speak more broadly on API design for durable execution systems, I agree completely. One of the issues with Temporal and Hatchet in its current state is that it currently abstracts concepts that are essential for the developer to understand, like what it means for a workflow to be durable, while the developer is building the system. So you end up discovering a bunch of weird behaviors like "non-determinism error" when starting to test these systems without a good grasp of the fundamentals.

We're investing heavily in separating out some of these primitives that are separately useful and come together in a DE system: tasks, idempotency keys and workflow state (i.e. event history). I'm not sure exactly what this API will look like in its end state, but idempotency keys, durable tasks and event-based histories are independently useful. This is only true of the durable execution side of the Hatchet platform, though; I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing.


> I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing

Indeed, I'm happy to hear you say this.

I think it should be the other way around: if durable tasks are properly understood its actually the queues/streaming/concurrency/ratelimits/retries that can be abstracted away and ignored.

Funny, I never realised this before.


> The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.

> The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.

It's a tradeoff. People tend to want to use languages they are familiar with, even at the cost of being constrained within them. A naive DSL would not be expressive enough for the turing completeness one needs, so effectively you'd need a new language/runtime. It's far easier to constrain an existing language than write a new one of course.

Some languages/runtimes are easier to apply durable/deterministic constraints too (e.g. WASM which is deterministic by design and JS which has a tiny stdlib that just needs a few things like time and rand replaced), but they still don't take the ideal step you mention - put the durable primitives and their benefits/constraints in front of the dev clearly.


Just to continue the idea: you wouldn't be constraining or tagging functions, you would relinquish control to a system that closely guards how you produce side effects. e.g doing a raw HTTP request from a task is prohibited, not intercepted.

Hah, well I'll avoid _talking to_ vendors, more specifically I'll avoid talking to salespeople selling a technical product until we're pretty deep in the product. I do tend to not use vendors that don't have a good self-serve path or mechanism to get my technical questions answered.

If anyone needs commands for turning off the CF proxy for their domains and happens to have a Cloudflare API token.

First you can grab the zone ID via:

    curl -X GET "https://api.cloudflare.com/client/v4/zones" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" | jq -r '.result[] | "\(.id) \(.name)"'
And a list of DNS records using:

    curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json"
Each DNS record will have an ID associated. Finally patch the relevant records:

    curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" -H "Authorization: Bearer $API_TOKEN" -H "Content-Type: application/json" --data '{"proxied":false}'
Copying from a sibling comment - some warnings:

- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.

- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.

- This will also reveal your backend internal IP addresses. Anyone can find permanent logs of public IP addresses used by even obscure domain names, so potential adversaries don't necessarily have to be paying attention at the exact right time to find it.


Also, for anyone who only has an old global API key lying around instead of the more recent tokens, you can set:

  -H "X-Auth-Email: $EMAIL_ADDRESS" -H "X-Auth-Key: $API_KEY"
instead of the Bearer token header.

Edit: and in case you're like me and thought it would be clever to block all non-Cloudflare traffic hitting your origin... remember to disable that.


This is exactly what we've decided we should do next time. Unfortunately we didn't generate an API token so we are sitting twiddling our thumbs.

Edit: seems like we are back online!


Took me ~30 minutes but eventually I was able to log in, get past the 2FA screen and change a DNS record.

I surely missed a valid API token today.


I'm still trying.

Still can't load the Turnstile JS :-/


Turnstile is back up (for now). Go refresh. I just managed to make an API key and turn off proxied DNS.


install tweak chrome extension and mitm yourself and force the js to load from somewhere else


Im able to generate keys right now through warp. Login takes forever but it is working.


Awesome! I did it via the Terraform provider, but for anyone else without access to the dashboard this is great. Thank you!


If anyone needs the internet to work again (or to get into your cf dashboard to generate API keys), if you have Cloudflare WARP installed, turning it on appears to fix otherwise broken sites. Maybe using 1.1.1.1 does too, but flipping the radio box was faster. Some parts of sites are still down, even after tunneling into to CF.


super helpful. thanks!

looks like i can get everywhere i couldn't except my cloudflare dash.


Its absurdly slow (like multiple minutes to get the login page to fully load for the login button to be pressable, due to catchpa...), but I was able to log into the dashboard. It's throwing lots of errors once inside, but I can navigate around some of it. YMMV.

My profile (including api tokens,) and websites pages all work, the accounts tab above website on the left does not.


Good advice!

And no need for -X GET to make a GET request with curl, it is the default HTTP method if you don’t send any content.

If you do send content with say -d curl will do a POST request, so no need for -X then either.

For PATCH though, it is the right curl option.


thanks for this! just expanded on a bit and published a write up here so it's easier to find in the future: https://www.coryzue.com/writing/cloudflare-dns/


I would advise against this action. Just ride the crash.


If people knew how to play the 5 hour long game they wouldn't have been using Cloudflare in the first place.


Hatchet (https://hatchet.run) | New York City (IN-PERSON | REMOTE) | Full-time

We're hiring for a number of engineering positions to help us with development on our open-source, distributed task queue: https://github.com/hatchet-dev/hatchet.

We launched on HN several times last year, you can check out our launches here https://news.ycombinator.com/item?id=39643136 and here https://news.ycombinator.com/item?id=40810986. We're two second-time YC founders in this for the long haul. Since our launches we now process over 50 million tasks/day on our cloud platform.

As an early engineer at Hatchet, you'll be responsible for contributing across the entire codebase. We'll compensate accordingly and with high equity. All team members are technical and contribute code.

Stack: Typescript/React, Go and PostgreSQL.

To apply, email alexander [at] hatchet [dot] run, and include the following:

1. Tell us about something impressive you've built.

2. Ask a question or write a comment about the state of the project. For example: a file that stood out to you in the codebase, a Github issue or discussion that piqued your interest, a general comment on distributed systems/task queues, or why our code is bad and how you could improve it.

Or apply online here: https://www.ycombinator.com/companies/hatchet-run/jobs

If you don't think you are a good fit for one of the roles, but you're interested in developer tools or infrastructure, please reach out anyway - we'll still consider your application!


One thing I've been wondering recently: has the experience of using software (specifically web apps) been getting better? It seems like a natural extension of significantly increased productivity would lead to fewer buggy websites and apps, more intuitive UIs, etc.

Linear was a very early-stage product I tested a few months after their launch where I was genuinely blown away by the polish and experience relative to their team size. That was in 2020, pre-LLMs.

I have yet to see an equally polished and impressive early-stage product in the past few years, despite claims of 10x productivity.


Agents depend heavily on the quality of their individual components, so it's pretty obvious that demo agents are going to be incredibly unstable. You need a success rate for each individual component to be near 100% or build in a mechanism for corrective action (one of the things that Claude Code does particularly well).


(we haven't looked too deeply into agent-kit, so this is based on my impression from reading the docs)

At a high level, in Pickaxe agents are just functions that execute durably, where you write the function for their control loop - with agent-kit agents will execute in fully "autonomous" mode where they automatically pick the next tool. In our experience this isn't how agents should be architected (you generally want them to be more constrained than that, even somewhat autonomous agents).

Also to compare Inngest vs Hatchet (the underlying execution engines) more directly:

- Hatchet is built for stateful container-based runtimes like Kubernetes, Fly.io, Railway, etc. Inngest is a better choice if you're deploying your agent into a serverless environment like Vercel.

- We've invested quite a bit more in self-hosting (https://docs.hatchet.run/self-hosting), open-source (MIT licenses) and benchmarking (https://docs.hatchet.run/self-hosting/benchmarking).

Can also compare specific features if there's something you're curious about, though the feature sets are very overlapping.


Definitely understand the frustration, the difficulty of Hatchet being general-purpose is that being performant for every use-case can be tricky, particularly when combining many features (concurrency, rate limiting, priority queueing, retries with backoff, etc). We should be more transparent about which combinations of use-cases we're focused on optimizing.

We spent a long time optimizing the single-task FIFO use-case, which is what we typically benchmark against. Performance for that pattern is i/o-bound at > 10k/s which is a good sign (just need better disks). So a pure durable-execution workload should run very performantly.

We're focused on improving multi-task and concurrency use-cases now. Our benchmarking setup recently added support for those patterns. More on this soon!


Hatchet is not stable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: