Case study of incident response
Environment and Prerequisite
- Cloud Service Provider(AWS, GCP, Azure)
- API
Background
- I wrote this post because there was an issue that API of Cloud Service Provider used in running service did not function properly.
- I used word ‘CSP’ to refer to the Cloud Service Provider in one of AWS, GCP, Azure to avoid pick one specific provider.
Timeline
- 5/20 08:30: Receive Slack Issue
- 5/20: Investigate reason of issue
- 5/21: Open support case after find cause
- 5/22: Issue resolved in CSP
- 5/23: Issue resolved
Issue and Cause
- Some of exported history were not returned when using CSP’s API
Impact
- An incident occurred in daily data collection which leads to subsequent failures in the connected pipeline and services.
Act History
- Manually create fin file which represents raw data collection is done after issue occured
- Open a support case due to a CSP issue.
Resolve
- Issue resolved after CSP’s API issue is resolved
Retrospective
- I thought there will be no issue in CSP or other third party software. However I changed my mind that CSP also can have issues.
- I studied incident response of Kakao issue which Kakao’s services were down due to a data center fire. The relevant YouTube video link is referenced in below.