[Cloud](EN) Case study of incident response for cloud service provider (AWS, GCP, Azure) API usage issue

Case study of incident response


Environment and Prerequisite

  • Cloud Service Provider(AWS, GCP, Azure)
  • API


Background

  • I wrote this post because there was an issue that API of Cloud Service Provider used in running service did not function properly.
  • I used word ‘CSP’ to refer to the Cloud Service Provider in one of AWS, GCP, Azure to avoid pick one specific provider.


Timeline

  • 5/20 08:30: Receive Slack Issue
  • 5/20: Investigate reason of issue
  • 5/21: Open support case after find cause
  • 5/22: Issue resolved in CSP
  • 5/23: Issue resolved


Issue and Cause

  • Some of exported history were not returned when using CSP’s API


Impact

  • An incident occurred in daily data collection which leads to subsequent failures in the connected pipeline and services.


Act History

  • Manually create fin file which represents raw data collection is done after issue occured
  • Open a support case due to a CSP issue.


Resolve

  • Issue resolved after CSP’s API issue is resolved


Retrospective

  • I thought there will be no issue in CSP or other third party software. However I changed my mind that CSP also can have issues.
  • I studied incident response of Kakao issue which Kakao’s services were down due to a data center fire. The relevant YouTube video link is referenced in below.


Reference