Sharing this piece from our youngest team member at Firetiger.
It's a really nice first-hand tale from the perspective of a junior engineer getting into the workforce, while the industry is going through the revolution of coding agents.
Like many developers, we've built our fair share of workflows that export data to 3rd-party services. They always start simple: pull data, hit an API, job done! Then the problems show up. We hit API limits, services go down, and those quick-and-dirty workflows become a major source of headaches.
The knee-jerk reaction is often to add a queue! Sure, it helps for a while. But queues introduce their own complexity: handling failures, managing retries, creating visibility... It's a band-aid, not a cure, and we've been wrestling with this problem for too long!
In this blog post, we'll break down:
- Why queues fall short when building truly resilient integrations
- The core principles behind building scalable, fault-tolerant async workflows
- Practical techniques that go beyond the limitations of queues
If you're done with fragile systems and want to level up your integration game, this one's for you!
The goal of this article was to lay the concepts and give insight into the technical machinery, so you’re right, it didn’t contain a lot of concrete examples.
Celery is one of those popular solutions that is commonly used. However, it comes with a complex framework for describing workflows, which is often difficult to master, especially for younger engineers.
With Dispatch, you just write code, as if you would write a local program, and our scheduler can manage the distribution and state of the execution. Testing the code especially gets really simple.
Having a hosted scheduler also means that you don't need to deploy any extra infrastructure like RabbitMQ or Redis.
Your comment on this being solved by inventing new programming languages like Erlang is right on point.
Our take is that distributed coroutines can be brought to general-purpose programming languages like Python so that engineering teams can adopt these features incrementally in their existing applications instead of adding a whole new language or framework.
In my opinion, the value is in being able to reuse the software tools and processes you're familiar with, major shifts are rarely the right call.
Yes, today the deployment model for production is to connect to a cloud service for the scheduling. You still run the code yourself, but the SDK needs to connect to the backend.
Distributed coroutines are a primitive to express transactional workflows that may last longer than the initial request/response that triggered it (think any form of async operation). While the distribution allows effective use of compute resources, capturing the state of coroutines and their progress is the key addition that enables the execution of workflows and guarantees completion.
A load balancer can help distribute new jobs across a fleet, but even the shortest of jobs can become "long running" when it hits timeouts, rate limits, and other transient errors. You quickly need a scheduler to effectively orchestrate the retries without DDoS-ing your systems, and need to keep track of the state to carry jobs to completion.
Combine a scheduler (like Dispatch) with a primitive like distributed coroutines, and you've got a powerful foundation to create distributed applications of all kinds without seeing complexity skyrocket.
OK, from what I understand, it's similar to what we do as well, except Dispatch adds magic while we do it all manually. We have an event-based system: instead of await points, we fire events which are stored inside an AMQP broker. The broker has N consumers on different nodes which take new jobs as they arrive. Retries/circuit breakers etc. are added manually (via a Go library), and if a job/event handler fails, it's readded back to the AMQP queue (someone else will process it later). Inside event handlers/job processors we also enjoy Go's builtin local scheduler (so I/O calls do not block entire cores).
I can see the benefit that with Dispatch, logic is simpler to read/to write as just ordinary functions, while in our approach, we have to scatter it around various event handlers/job processors. However, I still like that in our approach, event handlers/job processors are entirely stateless (the only state is jobs/event payloads), I've found it to be good for scalability and reliability + easier to reason about, compared to passing around internal coroutine state.
Yes, that sounds very similar indeed. We've launched Dispatch because this is a universal problem that engineering teams end up having to reinvent over and over.
Dispatch can also handle the "one-off" jobs you describe, where you don't need to track the coroutine state. In a way, it's a subset/special case of the distributed coroutine (just like functions are a special case of coroutines with no yield point).
> Go already has that. However, what are the advantages of rescheduling an already running coroutine on a different machine?
Say your coroutine is performing a transactional operation with multiple steps, but for some reason the second step starts getting rate limits from an upstream API, you need to enter a retry loop to reschedule the operation a bit later. What if your program stops then? Without capturing the state somewhere, it remains volatile in memory and the operation would be lost when the program stops.
This scenario occurs in a wide range of systems, from simple async jobs for email delivery to complex workflows with fan-out and fan-in.
> Isn't it expensive to serialize coroutines and transfer them between machines?
We only capture the local scope of the call stack, so the amount of data that we need to record is pretty small and stays in the order of magnitude of the data being processed (e.g., the request payload).
> I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?
We maintain stack traces across await points of coroutine calls, even if the concurrent operations are scheduled on different program instances. When an error occurs, the stack traces show the entire code path that led to the error.
There's definitely a lot of opportunities for creating powerful introspection tools, but since Dispatch is just code, you can also use your typical metrics, logs, and traces to instrument your distributed coroutines.
>We only capture the local scope of the call stack, so the amount of data that we need to record is pretty small and stays in the order of magnitude of the data being processed
What if the local stack references a deep object hierarchy? The classical "I wanted a banana, but got the gorilla holding the banana and the entire jungle"
>What if your program stops then? Without capturing the state somewhere, it remains volatile in memory and the operation would be lost when the program stops.
A local variable can end up referencing a large part of the program state. Some of this can be detected and automated to avoid putting unnecessary burden on the developers, but it's also part of our roadmap for Dispatch to create tools that help developers understand how their application will behave when it starts getting distributed.
This is why these solutions have often seen the creation of a new programming language as a necessary step to avoid those issues altogether. We think it's important to bring those capabilities to mainstream languages, we can't always go back from scratch with a new language.
Regarding where the state is stored, we keep everything in S3 and have plans to allow users to provide their own object store if they need to keep the application state in a storage medium that they own.