About The Position

The Senior DevOps Engineer, Cloud Infrastructure, leads a team dedicated to developing, deploying, and scaling cloud infrastructure that’s secure, reliable, and optimized for high performance. This hands-on role combines strategic oversight with technical leadership, supporting both project initiatives and operational excellence across cloud environments.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or an equivalent combination of education and experience.
  • 5 years in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles.
  • 3+ years of experience with AWS services (EC2, S3, ELB, VPC, IAM) or equivalent cloud environments, with a strong understanding of AWS best practices.
  • 3+ years of experience running Linux-based production systems, with in-depth knowledge of Linux operating systems.
  • Kubernetes: Hands-on experience managing, deploying, and troubleshooting Kubernetes clusters.
  • Scripting Languages: Proficiency in Bash, Python, or other scripting languages, used for automation and infrastructure management.
  • Infrastructure Automation: Expertise with tools like Ansible, Terraform, or CloudFormation to deploy and manage infrastructure at scale.
  • Monitoring and Observability: Experience with monitoring technologies such as Grafana, Prometheus, AlertManager, to maintain visibility into system health.
  • Version Control: Proficiency in Git and experience with platforms like GitLab or GitHub for collaborative code management.
  • Curiosity and Initiative: You’re curious, unafraid to ask “why,” and proactive in exploring solutions and innovative ideas.
  • High Availability Mindset: You prioritize resilience and reliability in everything you design and deploy.
  • Must have legal right to work in the U.S.

Responsibilities

  • Infrastructure as Code: Design and implement infrastructure as code to build and deploy cloud solutions effectively.
  • Full-Service Lifecycle Management: Improve service life cycles, from design through deployment, operation, and refinement, focusing on reliability and scalability.
  • Monitor and Maintain Services: Ensure live services run smoothly by measuring and monitoring availability, latency, and overall system health, proactively identifying areas for improvement.
  • Scale with Automation: Scale systems sustainably through automation and push for enhancements that improve reliability, performance, and operational efficiency.
  • Optimize Infrastructure Costs: Drive initiatives to optimize infrastructure for cost-effectiveness without compromising performance or security.
  • Incident Response and Postmortems: Lead sustainable incident response efforts and conduct blameless postmortems to ensure continuous improvement and resilience.
  • Tool Selection and Evaluation: Have opinions on and experience with orchestration tools such as GitLab and ArgoCD, guiding best practices for the team.
  • AWS Expertise: Leverage and enhance Amazon Cloud environments to support current and future infrastructure needs, staying informed on new services and practices.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service