Saga Pattern #3: Thinking Serverless

8 min readFeb 27, 2023

Se preferir, você pode ler esse artigo em Português aqui

In the previous articles, we talked about the main concepts about Saga pattern (first article) and isolation lack problem (second article).

So far, we’ve only focused on Saga pattern theory because that’s essentially what it is: a concept. This means we can implement it using whatever technology we desire and the best choice is the one that suits your business better. It turns out that in this article there’s no business involved (sad, I know 😥), thus all languages and tools used here are for exploratory purposes.

Enough talking and hands-on!

Several examples of sagas were given in the previous articles and it’d be impracticable to implement them all. Therefore, only the “Cancel Trip with Time Limit Reorganized” saga will be implemented here:

“Cancel Trip with Time Limit Reorganized” saga

It’s worth mentioning that this choice also ease some implementation decisions, especially, about reducing the risks of ACD transactions by leveraging the pessimistic view technique.

Saga Pattern via Choreography

I really like spoilers, so I’ll show you the diagram representing the Saga pattern implementation when using choreography:

Explaining the image, we have:

Trigger (A): it defines how each local transaction will be triggered. For most transactions, this trigger is an event published in SNS, except for the first transaction that’s triggered by an user request
Local transactions as Lambda (B): each local transaction represents a Lambda function that uses Python to implement the code
Database (C): considering the concept of Saga pattern, each service owns a database and it can be accessed during the execution of a local transaction
Decision maken after executing the local transactions (D): for sagas coordinated by choreography, each service is responsible for indicating when a transaction is over. In case of a successful transaction, an event is published in the topic of the correspoding service. In case of a failed transaction that exceeds the retry attempts of a Lambda function, an event is published in the DLQ for future processing.

Cool… But how we can convert this diagram into a real Serverless application?

That’s our next step!

⚠️ As well as it’s impracticable to implement all sagas discussed, it’s also impracticable to show how to create all the resources involved in the diagram in a single article. In order to simplify the example, the next steps explain how to create the resources required for “Check Time Limit (T1)” local transaction.

1: CLI Configuration

First things first, we’ll use AWS CLI to create the resources. Therefore, it’s required to configure the user credentials:

Note that there are two extra commands exporting the variables AWS_ACCOUNT_ID and AWS_REGION, they’ll help to avoid repetition of information for the next commands.

2: IAM Configuration

The Lambda function needs to interact with the topic and DLQ, so it’s required to define the following policy and trust policy:

Once the policies are defined, create the role:

3: Database creation

As told before, each service owns a database. Therefore, we also need to create this resource in AWS:

You must have noticed that PostgreSQL was the chosen underling engine for the database: there’s no particular reason for that choice. So, feel free to choose the engine that best suits your business.

4: Topic creation

All local transactions are triggered by events, thus SNS was chosen. For T1 transaction, the initial trigger is the user, but it’s still required to publish in the topic when the event is processed. So, we need to create the topic in which the transaction must publish (in this case, it’s the topic responsible for triggering the T2 transaction):

Similar to PostgreSQL’s choice, there’s no particular reason to justify the SNS choice as trigger for local transactions. You see what I mean, right? Use the tool that suits your business better.

5: DLQ creation

Although Lambda functions handles intermittent errors by retrying events, some errors are permanents. That’s when dead letter queues (DLQ) comes in handy: all events that exceeds the retry attempts are sent to the DLQ for future processing. Therefore, we need to create this resource:

6: Common layer creation (Lambda Layer)

In order to define an uniform way to access the database and avoid code duplication, a Lambda Layer is created to define a common layer shared between all Lambda functions. This layer can be created in the following way:

7: Lambda function creation

We’re finally able to deploy our code! 🥳

A Lambda function requires a default entry method when it’s triggered. Here, we’re using the AWS default def lambda_handler(event, context):

Well, you already know, there’s no…

particular reason for Python choice, I know, it’s clear as a bell 😑

Alright… Once you understand the code, just create the Lambda function:

After this step, the Lambda created will be something like:

Note that for T1 transaction, there’s no trigger defined: this is due to the fact that the initial trigger for this transaction is the user and each application can defined the best interface for this trigger using resources like API Gateway, Load Balancer, SQS, SNS, among others. For exploratory purposes, the initial triggered here is executing the Lambda function using AWS Console.

Furthermore, it’s worth to validate whether the DLQ configuration is correct. This is available in the console under Configuration -> Asynchronous invocation.

Voilà! 🥳

You’re ready to test the Saga pattern via choreography!

Saga Pattern via Orchestration

By using choreography coordination, in addition to create each resource, we must also define how they interact with each other (which topic is responsible for triggering a Lambda function for example). In orchestration, this management is delegated to another service and, as we’re using AWS, we can leverage from AWS Step Functions to define the steps required to trigger each Lambda. Because of this, there are relevant differences between choreography and orchestration diagram:

In the above diagram, the previous explanation about the local transaction as Lambda (B) and database (C) are still valid. However, the trigger (A) and decision maken after executing the local transactions (D) are affected when the Step Functions is used:

Trigger (A): the trigger is still responsible for defining how each local transaction (Lambda) is triggered, but now it’s AWS responsibility to decide which underling resource will be used to achieve this and we don’t need to control these resources explicitly. Therefore, we no longer need a SNS topic.
Decision maken after executing the local transactions (D): as Step Functions is the orchestrator, our single responsibility is to define the steps of the workflow and AWS is in charge of controlling the execution (both in case of success and failure).

Cool, but how we can configure this magic?

1: AWS Console Configuration

Unlike previous steps, for orchestration, we’ll use the console instead of CLI to leverage the Step Functions’ visual editor. As each AWS accounts has its details, it’s infeasible to handle all possible configurations here, but keep in mind that the user used within this stage must have, at least, permissions to access the console and edit workflows in Step Functions.

2: Reused resources already created before

The steps required for database, DLQ, common layer, and Lambda function creation are still valid here and can be reused.

Optionally, you might want to create a new Lambda function to compare the previous configuration and this new one. It’s up to you.

3: Create the workflow in Step Functions

Search for Step Functions on AWS Console and click on State Machines to visualize all workflows already created in your account:

You must select the type that best suits your need, but if you’re also here for exploratory purposes the Standard option is enough:

The following page is the Step Functions’ visual editor:

On the left side, choose AWS Lambda: Invoke and drag it into the editor, this will open a window on the right, in which only the names of the funcion (in this case route-check-time:$LATEST) and step itself (in this case T1: Check Time) need to be configured right now:

Repeat the previous action for all Lambda functions required until you have a workflow similar to this:

We also need to define which action must be taken in case of errors. For that purpose, select one of the steps and open the tab Error Handling on right side:

Now, select Add new catcher and indicate the error States.ALL, this will tell Step Functions that this step must be triggered in case of any error.

The action that must be taken is send a message to the DLQ. For that, use the Fallback state option with the value Add new state to create a new empty step in the workflow:

On the left side, search for SQS, select the Amazon SQS: SendMessage option and drag into the editor. After dragging the new step, a new configuration window will be open on the right side, in which only the SQS’s URL and step name need to be fullfiled:

Repeat the previous action for all DLQs until you have a workflow similar to this:

As all steps are now defined, just click on Next. On the next page, review the generated JSON and click on Next. On the last page, specify the workflow configuration (such as name, permissions, and log level) and finish by clicking on Create state machine.

In case you’ve created a new Lambda function, it’s possible to note the differences in the configuration once you finished the Step Functions’ workflow creation. Consider the route-check-time function:

Take a look at the destination configuration and note that it’s now empty: this is due to the fact that Step Functions is responsible for defining the whole workflow execution without requiring an explicit configuration in the Lambda function.

The same happens with the DLQ configuration: