Uplifting Observability in preparation for scale with Blinq

Background

Blinq is a platform for digital business cards, allowing individuals and teams to share their contact information seamlessly. The platform enables users to create customisable digital business cards, which can be shared through various channels such as NFC, email signatures, Apple Wallet and widgets. The company strives to simplify networking by offering features like contact context enrichment, easy updates and integration with CRMs, improving networking, sales, and customer engagement in a secure, eco-friendly and efficient manner.

The Challenge

Blinq is at an exciting time in the scale-up lifecycle, recently receiving a significant injection of funding as the user base is growing rapidly. Expanding the team needed to translate to increased development velocity, yet there are inherent risks when bringing fresh eyes to a platform. How could Blinq help the new engineers rapidly develop a comprehensive mental model of the application?

Blinq recognised they could move faster and grow more effectively by improving observability coverage, especially as they look to drive more aggressive feature rollouts to help customer acquisition. To support that goal, the Blinq Engineering team committed to reducing our mean time to detection by expanding and enriching our telemetry data.

A co-sourced SRE team of Blinq and Midnyte City folks was assembled to prioritise observability uplift. The work was approached iteratively by embedding the small SRE squad into product engineering teams. This achieved quick wins improving Blinq’s existing observability practices and accelerated observability culture adoption by working with the wider engineering team on specific areas and issues that were relevant for and familiar to each team's product domain.

By prioritising application observability in this critical organisational growth phase, Blinq is avoiding sacrificing customer happiness for the development speed required to meet business needs.

The Solution

Addressing platform stability was the highest priority. Blinq’s leadership recognised that beyond shipping impactful customer features, they needed to allocate engineering bandwidth toward team growth and observability maturity as well.

A goal was making the application telemetry comprehensive enough to know when an incident is happening before customers experience a degraded service, by using Real User Monitoring (RUM):

Mobile client
Browser client
Cloudflare Workers

Telemetry instrumentation was implemented with the engineering teams who owned these application components and dashboards were created for visibility. Dashboards became an effective tool for fostering an ongoing culture of observability. Giving application owners telemetry data created a feedback loop that individuals and teams used to take insights into the next sprint.

The pattern of identifying an area of uplift, working with the owners of that code to implement telemetry instrumentation, and then creating example dashboards derived from the telemetry data proved to be a powerful approach to organisational observability uplift. The main indicator of its effectiveness was seeing the engineering teams quickly begin developing their own dashboards derived from the uplifted telemetry data, demonstrating a shift in their understanding of application observability and an organisation wide maturity increase.

Dashboards became Blinq’s “Single Pane of Glass”, replacing the need to refer to several sources of information. When telemetry data is sent to one tool, (assuming the data contains the necessary information), that becomes the single repository for understanding behaviour of the application.

Increasing platform stability and observability maturity gave Blinq’s engineering team more bandwidth to learn about observability and explore new practices for connecting engineering effort to business outcomes.

Next the team configured alerts on the newly-accessible golden signal metrics. Infrastructure metrics like CPU or memory utilisation are often the most readily accessible. While these metrics can be indicators of at-risk business outcomes (if CPU utilisation is high for a long time, there could be something wrong that is affecting customer experience), they are lagging indicators at best, or noise that escalates alert fatigue at worst. When everything is an alert, nothing is.

Mature observability involves deriving metrics from application telemetry that more closely model customer happiness, and therefore business outcomes. If a CPU utilisation alert is a true indicator of customers having a degraded experience using your application, how long have they been negatively impacted before the alert went off? Inversely, if your CPU utilisation is high but there are no downstream symptoms impacting user experience, then there’s no business outcome at risk and we are making effective use of the resources.

The four Golden Signals: Latency, Traffic, Errors, and Saturation. For organisations that are early in their observability maturity journey, Latency and Errors are effective starting points.

Using latency as an example, once application telemetry is comprehensive enough that it is possible to derive HTTP response times per request, this telemetry data can be aggregated and measured as a metric: For a given period of time, what were the HTTP response times for user requests across all endpoints in our application, or across specific important endpoints? This is a far more effective model of user happiness than CPU utilisation or other “machine happiness” metrics - a 500ms increase in response time can lead to a 20% reduction in user traffic, which is a clear indicator of business outcomes at risk.

Once Blinq’s application telemetry data was sufficiently comprehensive, and key dashboards tracking application signal-derived metrics like Latency and Error Rate were created and used effectively by the engineering team, the next step towards better observability maturity was to implement Service Level Objectives (SLOs).

The co-sourced Midnyte City and Blinq team ran a series of workshops with the operators of key pieces of application functionality and collaboratively defined SLO’s on the new application signal-based metrics. An example SLO might be, “Within a rolling 7-day window, 95% of response times on our important endpoints should not exceed 500ms”. The workshop format is effective for setting initial SLOs, promoting a shared understanding of the observability data and tools, and reducing information silos by having team members consider what the important application signals for each service are.

With SLO’s in place, Blinq can track application performance over time, using data to drive work prioritisation. If a feature is released that increases response times for those key endpoints, that can be visualised as “burning down” the SLOs budget. Product Managers and teams can use that information to determine to continue with planned feature work, to move faster, or re-prioritise to address the flawed release and avoid risking business outcomes.

The Result

Through better application observability Blinq were able to identify areas for application performance improvements, catch degrading trends much faster and reduce mean time to restore significantly and to ultimately increase platform resilience. The iterative approach gave the SRE team regular opportunities to engage constructively with the wider engineering team and showcase telemetry’s usefulness and observability culture. This helped uplift observability maturity and capability right across the group driving behaviour changes with curiosity and a desire for efficiency. This enabled product engineers to embed SRE principles into their daily workflows, while SREs deepened their domain expertise, embracing our SRE‑as‑a‑coach model, where reliability is a shared accountability rather than owned by a single SRE team.

As maturity grew, teams increasingly identified and addressed incidents before they impacted the business or customers. The feedback loop from telemetry data improved delivery hygiene and continues to assist the organisation to scale as new engineers join.

At the conclusion of the engagement, dashboards and SLOs were in place and Blinq were using DataDog as the single pane of glass. The culture of observability had become well entrenched and Blinq were set to continue progressing observability maturity, breaking down information silos, and tying their development effort directly to business outcomes, allowing them to continue growing at pace and continue to deliver high impactful customer features.

Testimonial

Engaging Midnyte City’s SRE resources was a game‑changer for Blinq. Their expert team partnered with our SRE and Product engineers to elevate our observability maturity, giving us end‑to‑end visibility into critical services and features. Within a short period, we confidently identified degraded signals to reduce our mean time to detection and mean time to restore from undesired app behaviour, dramatically reducing the risk of impact from incidents to our customers.
‍
Beyond tooling and automation, Midnyte City partnered with our SRE team to successfully apply our ‘SRE‑as‑a‑coach’ model which empowered our product engineers to internalise SRE best practices. Our teams now design and ship high‑impact in‑app features with reliability built in, rather than retrofitting it afterward. The result is a faster, more resilient Blinq application and a consistently outstanding user experience, exactly the foundation we needed as we continue to grow at a high pace.

Partnering with Midnyte City meant that their team of experts quickly became deeply ingrained and valued members of our already high-collaboration engineering culture. Midnyte City will be our go‑to partner for future projects, and we look forward to collaborating with them again.

Ilonke Pretorius, Senior Engineering Manager at Blinq

Case studies

Uplifting Observability in preparation for scale with Blinq

Background

The Challenge

The Solution

The Result

Contact us