Contact and organization updates in iOS app
Expanded features for contacts and organizations in Streak's iOS app
We build software with the intention of powering the core work of our customers. If we want Streak to be at the center of business, the very first feature we need to offer is reliability. For us to be successful, we need you to be successful, and availability is where that begins.
We recently had a service interruption from 5:30 AM-8:45 AM PST on Wednesday, October 2nd. During that time, Streak was unreliable or unusable for most customers. This didn’t meet our standards, and we know it didn’t meet yours. We apologize for the interruption to your business, and want to give you context on what happened and what we’re doing to make sure this doesn’t happen again.
We’ve been making significant changes to our infrastructure so we can better support collaboration features in Streak. Some of those features have already launched (add email to more than one pipeline) and more are coming in the near future. In support of imminent upgrades, we ran a data migration on the evening of Tuesday, October 1. This migration added a new layer of permissions to ensure that our users are able to better control the sharing of their emails, and involved creating a permissions record for any email added to a box.
The permissions code in question had been tested in our continuous integration environment and had been tested with a small number of example accounts in production. The data migration ran without issue, and Streak was working as expected Tuesday evening.
The incident started at 5:30 AM PST Wednesday morning. Our automated monitoring system correctly detected a service degradation. The specific alert that was triggered had been noisy recently due to another migration we had run the week before, and did not successfully page our on-call engineer.
At 6:45 AM PST, our support team came online, and manually paged the on-call engineer at 6:55 AM, starting our engineering response. The support team then replied to all users, who had either emailed email@example.com or were able to access our live chat, while we identified the root engineering cause.
While the on-call engineering team could diagnose the immediate symptom of user-visible errors from API requests and API timeouts, they ran into multiple issues that delayed finding the root cause of the incident:
At 7:45 AM, the support team followed up with users who reached out to us on Twitter and posted to updates.streak.com. At the same time, the on-call engineer was able to successfully inspect a server that was having issues. We observed that unoptimized code was overwhelming the server when trying to simultaneously load multiple users with many boxed emails. Each server handles many users at the same time, so this affected many other teams as well. Streak clients will retry their requests if a server is overloaded, leading to many servers being overloaded in the same manner.
After verifying the root cause with an engineer familiar with the permissions changes, we deployed an updated version that fixed the unoptimized code. The deployment was delayed due to the server being overwhelmed. Service was returned to normal around 8:45 AM PST.
Our automated monitoring didn’t perform as it should. We are going to tune our automated alerts, add more engineers to the on-call rotation, and establish a regular cadence of reviewing monitoring updates to ensure that we’re prioritizing improvements of noisy alerts.
Context around the production migration was not adequately shared. We’ve updated our process around events that affect production to ensure that on-call engineers have the context they need to debug issues.
Monitoring and debugging shortfalls delayed our response to the incident. Over the coming quarter, we’re going to invest in our deployment and monitoring stack to ensure that it works as expected during incidents.
We focused too heavily on replying to users using our firstname.lastname@example.org help channel in a 1:1 manner. In response, we’ve already deployed a Status Page that shows any current incidents as well as historic ones. In addition, we’re creating a communication playbook for any incidents going forward to make sure we provide continual updates to our customers.
We want to apologize again for this outage. Looking ahead, we can’t wait to show you the major improvements we’ve been working on. We’re confident the new features will save time and energy for your team. As we roll them out, we’ll make sure the new changes don’t affect what you originally came for: rock solid CRM, directly inside Gmail.