Google started the week with a big outage that took down Gmail, Drive, and all other Workspace apps. As promised, Google now has a detailed explanation on the outage and steps it will take to prevent future incidents.
At a high level, the issue relates to existing work updating Google’s account authentication system. As the effort was ongoing, previous components were “left in place.” While keeping those older aspects resulted in an error about usage being at 0, Google instituted a grace period to delay the impact.
That remedial fix expired and led automated systems to respond to the error as if it were real. Since usage appeared to be at 0, capacity for the identity management system was scaled down. While safety checks were in place, they were not designed to cover the specific problem.
The issue started impacting users at 3:47 a.m. PT and engineers were alerted a minute later. “Workspace apps were down for the duration of the incident” since they rely on the impacted infrastructure to make sure you’re logged in, authenticated, and authorized to see content, like emails and documents.
At 04:08 the root cause and a potential fix were identified, which led to disabling the quota enforcement in one datacenter at 04:22. This quickly improved the situation, and at 04:27 the same mitigation was applied to all datacenters, which returned error rates to normal levels by 04:33.
The company laid out plans to review, improve, and evaluate its systems to prevent similar issues of this nature. Google ended its outage explanation with an apology:
We would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions.