August 22, 2022
Infrastructure is key to Paradigm and as such optimizing AWS is key to achieving low latency for our clients. Below we deep dive into one particular problem our engineers have solved and are open-sourcing our solution!
In order to work around an AWS limitation of not being able to create the ECR repositories on the fly, we use AWS Lambda triggered by failure logs in AWS CloudTrail to create missing ECR repos.
Here at Paradigm we make heavy use of Kubernetes and microservices. A significant portion of our infrastructure is in AWS, which means that developers and infrastructure engineers interact with AWS technologies such as Elastic Kubernetes Service (EKS) and Elastic Container Registry (ECR), among others.
ECR holds the Docker container images with microservice code. A developer or preferably a CI/CD process pushes those images to our private ECR registry. These images are later used to create service instances in the EKS cluster.
The general idea is that a single repository holds many variants of a service image, and the tag identifies each unique image version. The Docker image URL for ECR repositories uses the following format:
For a more detailed breakdown of the image URI scheme format see the Appendix.
Unlike other registries, AWS ECR does not automatically create a repository on a push event (e.g. on docker push). The image repo (<repo_name> , or backend/proxy above) has to be pre-created before it can accept pushes. Once the image repo is created, it’s possible to push any tagged image variant to an existing repository. This is in contrast to popular image registries such as Docker Hub, Quay.io, JFrog, and others, which will conveniently create a repository on push, if one doesn’t already exist.
This means that when creating a new microservice, developers require additional AWS permissions to create a repository to hold the images for that microservice, before the images can be pushed. This is an annoyance for developers and an annoyance for the infrastructure team.
But what if we could change that? Could we figure out a way to create ECR repositories automatically?
It turns out we can!
For the solution we tie together a few pieces of AWS wizardry:
- AWS CloudTrail logs an event when an attempt is made to push a docker image to a repository that doesn’t exist.
- An AWS EventBridge rule triggers an AWS Lambda function on the above CloudTrail log event.
- The AWS Lambda function makes an API call to create the missing repository.
Let’s get into some detail. AWS CloudTrail aggregates events for almost everything that happens inside AWS. These event logs have a lot of useful data, such as event name, source, resource, and other details, depending on the type of event. When something or someone attempts to push to a repository that doesn’t exist, an event named “InitiateLayerUpload” is logged in CloudTrail with a “RepositoryNotFoundException” error. We can create an EventBridge rule to watch for these events. This EventBridge rule triggers our AWS Lambda function to create the desired ECR repository.
Since we use Terraform to provision our AWS resources, we can create a Terraform module with all the necessary resources to set it up.
In Terraform we create some resources:
- “aws_cloudwatch_log_group” to store our Lambda logs
- “aws_iam_role” with appropriate “aws_iam_role_policy” to permit our Lambda to perform desired actions on the ECR repos
- “aws_lambda_function” itself, with some Python code relying on the “boto3” (AWS SDK for Python) to do all the work
- “aws_cloudwatch_event_rule” (EventBridge rule), with “aws_cloudwatch_event_target” to kick off our Lambda on rule firing
All of these resources become part of our Terraform module, allowing us to group and package them for reuse. The module takes input through a number variables to allow some customizations, such as Lambda log retention period, image tag mutability, repository lifecycle policy, scan on push setting, repo tags, and, finally, a list of ECR repo prefixes. The repo prefixes are used to prevent the Lambda function from creating unapproved repository names, enforcing a repo naming pattern.
With all of this in place, we should get our ECR repositories created on-demand, before the docker push operation fails, since it automatically retries on failure up to 4 times. While the Docker CLI is retrying, the Lambda function does its thing, one of the retries succeeds, and users hardly notice a transient failure.
We’re excited to release this solution today as an open source Terraform module. It’s available at https://github.com/tradeparadigm/terraform-aws-ecr-repo-lambda. Contributions are both welcome and encouraged!
If you’re interested in working in an environment where sharing such things is both possible and appreciated, we’re hiring!
The following Backus–Naur like grammar describes the valid ECR repo image URI needed to push an image to the ECR repo: