4 Apr, 2023
Hopefully, it’s all deployed as infrastructure as code, but that doesn’t help with keeping track of infrastructure and systems once they are deployed. You have other monitoring and alerting tools for that, some of these inbuilt into AWS such as CloudWatch, Inspector, AWS Config and Trusted Advisor (along with many more) and these can provide immense help to understand how your infrastructure and systems are behaving.
One AWS tool I’ve used previously, to monitor specific AWS events that I’m interested in, is AWS EventBridge. EventBridge is a serverless service that uses events to connect application components together, it aims to provide a simple and consistent way to ingest, filter, transform, and deliver events. You can publish your own applications events onto AWS EventBridge, but also all events that happen in your own AWS account are published to EventBridge. From a lambda being fired, to a message arriving in a queue, or an auto-scaling group being triggered, it’s all captured.
The use case we’ll explore today using EventBridge is a problem I had in a previous company I worked in. I was in the cloud/platform team, and there were development teams who weren’t owning their applications once they were deployed in AWS. They would hit deploy on their CICD tool, and if it went green, and no alerts went off, that was enough for them to be satisfied. What we would find sometimes is things happening on the AWS side, like ECS tasks continuously failing, that would indicate something was wrong with the application, but the teams didn’t know about it. We wanted a way to surface this information easily, without having to tell teams to log into the AWS console. Or even worse, to have an outage or an incident because an issue had been occurring behind the scenes for a long time.
In this use case, teams deployed into ECS so naturally all events relating to tasks/services failing were being captured in EventBridge. We just had to figure out what events we cared about (we found out there were a lot we didn’t). The first step was creating a rule to capture these, a rule is basically a filter on an EventBridge that describes what events you care about .i.e. What AWS service do these events come from, what are the triggers for these events.
During this discovery process, we found it easier to create some rules manually while we were getting the hang of which events we wanted, because the UI has some tools to help step you through. You can use an EventBridge schema, event pattern or a custom pattern to define which events you want the rule to capture. This screenshot is from the event pattern, you can use the AWS service dropdown to pick what service you want and see the event options. It’s also not visible on the screenshot, but all AWS events are available on the ‘default’ EventBridge.
To get information about ECS service failures, we ended up with the following event pattern:
"ECS Service Action"
Filtering based on the service actions meant we had events with a whole bunch of data (which ECS cluster, timing, service name) and it also allowed us to get the warning/failure events that we actually cared about (there were a whole lot of information events that we did not care about). These events actually had useful event types and names, for example:
The service is unable to start tasks successfully, this could be for a number of reasons (and often application logs would help diagnose further)
We had EC2 backed ECS clusters, and if teams had incorrect autoscaling rules setup, it would mean that new containers couldn’t be placed because of resource constraints (and the event would provide us what resource was constrained e.g. RESOURCE:CPU or RESOURCE:MEMORY)
Once you have your rules defined, then you get to decide where this goes. AWS has some good documents, and partners for EventBridge destinations, including Zendesk, Splunk, Sumo Logic, Slack and Datadog. You can use events to trigger things within your AWS account as well, e.g. trigger a lambda, or for cross-account events, but for our use case we wanted to get the event out of AWS to a place where the development teams would more easily see it. We chose slack, because it was something the development teams used regularly, and we wanted these notifications to be easily visible and able to be actioned quickly.
We wrote a small slack bot that would filter out the reason for the ECS failure, and send this as a slack message. As we learned more about the events, and some useful tips for resolving, we would provide these in the slack messages/warnings we would send to teams. We found it really effective in encouraging developers to take more ownership of the operation of their applications in a cloud environment. As opposed to before, when these failures were still happening, but were just invisible to the development teams. The outcome of this was a lot more discussion from developers with the cloud/platform team about service health (which was exactly what we wanted), and resulted in finding lots of failures in ECS deployments before they caused incidents.
What I really wanted to emphasise with this use case is the power of having access to every single event that happens in your AWS account, however with great power comes great responsibility! Building random bots that send slack messages based on arbitrary values is never useful, but if you have specific problems in your AWS account that creating visibility into these in other systems could solve… consider EventBridge.