Connecting Steps in a Sagemaker Pipeline DAG

Casey Whorton
Nerd For Tech
Published in
4 min readSep 21, 2023

--

A small change can fix this DAG

In this post, I’ll be discussing a small code addition to any Sagemaker Pipeline that gives the Pipeline’s diagram a fully connected look when using a LambdaStep. I don’t know about you, but it bothers me when there is no directional arrows that logically connect steps in a diagram when there should be. I recently added an AWS Lambda function to the beginning of a Sagemaker Pipeline and was unsure of the mechanism to get these pipeline steps to show up connected in the Directed Acyclic Graph (DAG).

AWS Lambda functions can offer lightweight, flexible and customizable options for preprocessing in machine learning pipelines. I personally don’t see them used often in Sagemaker Pipeline documentation and examples, but instead I tend to see a Sagemaker ProcessingStep with a supplied Python file containing the data transformations. (That may be a preferred pattern for Sagemaker, but there may be situations where a custom Lambda Function could offer something that the ProcessingStep cannot.)

But what if a Lambda function already exists that performs the data preprocessing that we want, or we want to benefit from a serverless solution? Luckily, a LambdaStep is a type of Sagemaker Pipeline step that can fit into any pipeline.

Let’s say we have a lambda function like the one seen below:

This lambda function performs some preprocessing and writes the data to an object on S3, call it train_path. Calling it directly from a Sagemaker Pipeline is easy too, all you need for a basic execution is to know the lambda function’s ARN and the sagemaker execution role ARN:

You can individually create a model training step and run it after defining the Sagemaker Pipeline. It should still run without using any previous or next step properties, but it will appear by itself in the DAG and that gives the wrong impression when sharing the DAG with an audience. We can set a training step and hard-code the model’s training data location on S3 using something like the code below, but this loses the benefit of the custom Lambda function or any other preprocessing steps that dynamically create training data or artifacts.

Manually adding the training data path to the TrainingStep technically works, but not a dynamic approach.

What’s worse, is that when you view the Sagemaker Pipeline during or after its execution, the LambdaStep is not connected to any other step. The way steps are linked in the DAG is by creating a dependency between the steps, and this means utilizing the properties of each step.

According to the documentation, passing a property or output of one defined step as the input to another pipeline step creates the dependency (https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#step-dependencies).

For a LambdaStep, you can specify any values to be passed from the lambda function to the proceeding pipeline step by defining them in the return dictionary from the Lambda function itself. In this example, I have a lambda function that performs some data preprocessing, saves the processed data in another S3 location and returns three values: a status code, a response body and the training data path that I want to use in the model training step. (See the code gist at the top of the post.)

Let’s add the Lambda function outputs and LambdaOutputs and list them in the LambdaStep so one or more of them can be picked up by subsequent steps:

The only big changes here are an addition of 3 parameters to represent what the Lambda Function returns, and a list of those parameters in the LambdaStep called “outputs”.

By using a property of the LambdaStep in defining the TrainingStep, I was able to get the the steps appear as connected in the Sagemaker Pipeline DAG. See the gist and image below:

Now the TrainingStep refers to something from the LambdaStep

There are two things to notice:

  1. Instead of hard-coding the training data filepath, we used a property of the LambdaStep as an input. This is what creates the dependency.
  2. We reference the Key from the Lambda Functions key-value pair that is returned, not the LambdaOutput variable.

The results are great! Instead of having the LambdaStep (or any other step) floating by itself in the DAG, it is now directly connected to the model training step. Similarly, the CreateModelStep uses a ModelArtifact property to connect the training step to the create model step.

For all other pipeline step types (TrainingStep, EvaluationStep, etc.), there are different properties that can be used to create dependency between pipeline steps. The documentation linked in the resources section lists them. Similar to the pattern seen here with the LambdaStep, finding out how to access and pass on relevant step properties is the key to maintaining a fully connect DAG for the Sagemaker Pipelines.

Resources

https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#data-dependency-property-reference

--

--

Casey Whorton
Nerd For Tech

Data Scientist | British Bake-Off Connoisseur| Recovering Insomniac | Heavy Metal Music Advocate