In the development process, SREs provide developers with stable and performant CI and release pipelines and development environments to facilitate frequent delivery of new product features. In production, SREs perform Tier 1 on-call and incident management functions, supporting a high-throughput platform which processes more than 15 billion events per day. To ensure the reliability of this environment for our customers, SREs work closely with developers and product managers to understand service level objectives, think through failures scenarios, and design systems which balance cost with reliability objectives. Additionally, SREs collaborate with the Information Security team to ensure that cloud infrastructure is properly secured, and that sufficient controls are in place to meet our compliance goals with respect to industry standards such as SOC 2.
Role Responsibilities
Write high-quality infrastructure-as-code that automates the provisioning, deployment, scaling, and monitoring of infrastructure to ensure that it is reliable and performant
Write maintainable code for product functionality with a primary emphasis on operations, scale, resiliency, and monitoring
Work with other engineers to ensure that new services are well-designed, properly monitored and have well-defined SLIs and achievable SLOs
Debug production issues, learn to mitigate them quickly, and find ways to prevent them
Maintain runbooks for manual tasks and replace those runbooks with automation whenever possible
Proactively track our capacity, quotas, and other performance limits to plan for growth
Participate in a 24×7 on-call rotation to handle product availability issues as well as urgent customer support escalations
Experience working with cloud infrastructure using tools such as Ansible or Terraform
Programming skills in a language such as Go or Python, and a willingness to learn new
languages as needed
Ability to think and talk about systems in terms of possible failure modes, bottlenecks, etc.
Ability to write clear and concise English-language documentation of processes for incident
runbooks and release processes
Good number sense for discussing performance analysis, cost analysis, and operational metrics
Preferred Qualifications
Experience designing, analyzing, and troubleshooting distributed systems
Experience maintaining Kubernetes clusters in a production environment
Previous experience as a Site Reliability Engineer, DevOps Engineer, or similar role