Finding the Signal in a Service-Oriented World

4 months ago 17

It’s payroll time on a Friday. Do you know what your services are doing?

The answer at Justworks is yes, but how we have arrived there has changed over the years.

Today, Justworks products are based on a service-oriented architecture, with any user request passing through a handful of services before returning a response. Engineers track service performance using a combination of logs, traces, and metrics to prove that services are meeting a high performance bar.

From Monolith to Microservices

For its first decade, Justworks offered a single product (the PEO) built in a monolith. Every request was serviced by a single machine, so error attribution and performance monitoring was relatively straightforward. Log aggregation and error tracking provided sufficient visibility into the system.

In the past few years Justworks has grown rapidly, leading to multiple new products to support small businesses. Between Payroll, Global Hiring, Advisor Services, and more, Justworks has moved from a monolith to a service-oriented architecture (SOA). A single request now traverses multiple machines, which makes connecting execution across the system more complex. Issues in a downstream service could cause further issues in any upstream service. Logs still provide detail to execution on a single machine, but logs alone don’t have the distributed context needed to understand system behavior.

The service-oriented system had the same observability needs, but they were no longer met by the tools used by the monolith. Justworks needed:

To understand system behavior for any operation. This now included operations that extended to multiple services.
To track new errors and regressions. Errors in the SOA are not necessarily contained to a single runtime call stack, because their effects can span over a network of requests.
To ensure application performance. In a collection of services owned by a handful of teams, product-level performance is a distributed responsibility.

In 2010 Google pioneered the concept of distributed tracing in the Dapper paper. The general idea is that a trace records all of the work done in a system on behalf of a given initiator, and a trace is made up of spans which represent basic units of work. Since Dapper, distributed tracing has become a key tool in observability of distributed systems. Justworks has solved its needs through distributed tracing.

To understand system behavior across services, Justworks uses distributed traces which are correlated with service logs and metrics. The combination of these sources results in a holistic understanding of the entire execution of any request made to Justworks. Traces provide a distributed context to any system behavior, which is vital to identifying and solving issues in a service-oriented architecture.

Trace Flame Graph

To track errors over time, the Customer Access and Identity Management (CIAM) team uses an error tracking system that aggregates unexpected errors into issues by attaching metadata to traces and logs. Monitors are configured to alert the team when new issues or regressions arise, providing real-time notifications of service disruptions. This allows for fast response during on-call and helps the team prioritize fixes based on error frequency and severity. Over time, issues accumulate rich context as related occurrences are grouped, making ongoing service maintenance more efficient and informed.

Error Tracking

Ensuring performance across a distributed system becomes a distributed responsibility. Establishing service-level performance objectives provide a contract of sorts between client and server.

The next section will dive a bit deeper into setting these objectives.

Latency Service-Level Objective

Setting Performance Expectations

The CIAM team at Justworks provide a platform for securely identifying users, gating access to data, and complying with anti-money laundering regulations. As products migrated from bespoke solutions for these problems to the CIAM platform, there was an increased need for visibility into platform performance:

In order for products to serve customers, they need to remain available and correct. Those products now depend on CIAM services, so CIAM services need to remain available and correct.
For the best customer experience, interaction with the products should be responsive. In order to minimize latency of upstream services, CIAM services need to maintain a strict level of performance.

These needs are broad and ill-defined, but they can be summarized by a need to provide a specified level of service. In observability, Google popularized the concept of service-level indicators (SLIs) and service-level objectives (SLOs) to more precisely define a “level of service”. The general idea is that an SLI is a metric that is meaningful to your service, and an SLO is a target for an SLI. To define SLIs and SLOs, it’s useful to first understand a client’s need, derive an SLO from that, and work backwards to choose an SLI that provides a good measurement for that SLO.

Let’s look at defining SLOs for the CIAM authorization service. Comparing observed latency of authorization requests (from tracing data) with client feedback, the team found that clients were happy with responsiveness if authorization requests were served in under 30ms. Setting an SLO for p90 latency (SLI) of 30ms satisfies the client need for a responsive API in the common case. A p90 latency will reflect the common case, but what about the tail? The team analyzed the p99 latency over the past couple months to find the authorization service’s p99 was around 110ms, so the team set an SLO of 150ms to set a baseline to monitor performance drift. With the p90 and p99 SLOs both the common case and worst case can be monitored to ensure consistent performance.

Low latency responses are important to providing good service, but providing service without errors is at least as important. In order for the authorization service to be treated as a reliable dependency, it needs to serve virtually all requests without error. An SLO was set for a success rate of 99.9%, in which case the SLI is success rate:

1 minus the quotient of number of errors and total requests

While the CIAM authorization service uses a request-response pattern, defining SLIs for other service types, such as asynchronous batch jobs or data consistency services, requires a different approach, often focusing on metrics like completion rates, processing time for queues, or data freshness.

CIAM uses SLOs and monitors in order to connect SLOs to track error budgets in real time. An error budget is the rate at which an SLO can be missed. If the error budget goes under 5%, then on-call is alerted and steps are taken to avoid falling below that budget. This triggers a high-priority discussion, potentially shifting priorities to address the underlying issue. The team may decide that service resources need to be scaled up, a recent change must be rolled back, or even a deployment freeze is necessary to ensure service-level objectives are met. A direct link between error budget and operational decisions is vital to maintaining reliability.

The Value of Service-Level Objectives

Establishing SLOs has elevated the service quality of the CIAM platform. They’ve also provided the organization strategic value:

We can track long term service health. It’s easy for a service to slowly drift into poor performance as features are added. One real example: the team noticed that our available error budget for one of our services was progressively getting lower week-by-week. This was a signal that performance was gradually degrading, and prompted an investigation. That investigation found a database index was missing for a common query. Adding that index ensured that the service was meeting SLOs and improved the experience for our clients.
SLOs have facilitated a stronger relationship with our client teams. Clients have a clear view of our service performance that provides confidence when depending on our services. It also provides a starting point for clients to base their decisions off of. If the SLOs we provide don’t meet a client’s needs, then we can have a discussion with them on how to meet those needs.
They’ve provided the team an objective metric on which to base tech debt investment. If we are meeting SLOs, then our services probably don’t need significant investment into performance. There are reasons other than performance to address tech debt, such as maintainability or security improvements, but SLOs give a clear signal on whether or not to invest in performance.

Repeating the Signal

As Justworks scaled, the CIAM platform hit early SOA growing pains. To share learnings across the organization, error tracking and SLO monitoring were implemented in a repeatable way. CIAM used a Terraform provider to create modules for parameterized error tracking monitors and SLOs as code:

variable "service" {
type = string
}

variable "env" {
type = string
}

variable "slack_channel_prod" {
type = string
}

variable "team" {
type = string
}

module "authz_monitor_prod" {
source = "modules/apm"
version = "1.5.4"

service = var.service
env = var.env
team = var.team
operation = "trace.express.request"
notifyees = [var.slack_channel_prod]

# Alert if latency is >10x p90 SLO because it's an anomaly
p90_latency_critical = 0.300

# Alert if a new issue or regression occurs
issue_rate_toggle = true
new_issue_toggle = true
}

module "authz_slo_prod" {
source = "modules/service-level-objectives"
version = "1.5.4"

service = var.service
env = var.env
team = var.team
notifyees = [var.slack_channel_prod]
p90_latency_threshold_secs = 0.030
p99_latency_threshold_secs = 0.150
success_rate_threshold = 99.9
}

When the Terraform modules are applied, a handful of resources are created to support error tracking and SLOs. To further simplify service health observability, the CIAM team leveraged dashboard Powerpacks (templated groups of widgets). These widgets can be dropped into any team or service dashboards to surface issues, SLOs, request rates, p90 latency, and resource usage. All at a glance.

Service Health Powerpack

These modules and widgets, paired with how-to documentation, empower any team with tracing data to quickly set up service performance monitoring. By packaging powerful observability tools in a simple, easy-to-adopt format, we’ve seen natural uptake across teams at Justworks, alongside strong support from senior leadership.

Read Entire Article

Finding the Signal in a Service-Oriented World

From Monolith to Microservices

Setting Performance Expectations

The Value of Service-Level Objectives

Repeating the Signal

Related

Blueprint of a next generation social network

Cloudflare Domain Ranking Hiijacked

ECL Runs Maxima in a Browser