About The Position

We’re an AI-first global tech company with 25+ years of engineering leadership, 2,000+ team members, and 500+ active projects powering Fortune 500 clients, including HBO, Microsoft, Google, and Starbucks. From AI platforms to digital transformation, we partner with enterprise leaders to build what’s next. What powers it all? Our people are ambitious, collaborative, and constantly evolving. About the Client The company has been building solutions for mobile apps, effortless payment, business travel, and advertising since 1992. The customer is developing a mobility platform that allows operators to manage their vehicles and drivers efficiently, regulators to be informed and establish guidelines, service providers to deliver sustainable solutions, and riders to have an effortless transit experience. What You’ll Do Design, build, and operate reliable, scalable distributed systems Improve system availability, performance, and resilience Automate infrastructure, deployments, and operational processes Diagnose and resolve production issues Lead upgrades and migrations with minimal or zero downtime Participate in on-call rotations and incident response Collaborate closely with development teams to improve operability Drive best practices around monitoring, alerting, and capacity planning Reduce operational toil through automation Contribute to incident management, post-mortems, disaster recovery strategies, and continuous reliability improvements

Requirements

  • 7+ years of experience, specializing in Kubernetes and AWS
  • Strong programming skills in at least one major language (Ruby, Java, Go, Python, .NET, or similar)
  • Solid understanding of concurrency, runtime behavior, and performance optimization
  • Hands-on experience with Docker and containerized workloads
  • Strong Kubernetes expertise (Deployments, StatefulSets, Services, Ingress, Helm, troubleshooting, autoscaling)
  • Strong AWS experience (EC2, EKS, RDS, S3, IAM, VPC, Load Balancers, CloudWatch)
  • Experience designing infrastructure for high availability and disaster recovery
  • Experience with CI/CD pipelines and Infrastructure as Code (Terraform, CloudFormation, Pulumi, or similar)
  • Experience with RabbitMQ or similar messaging systems (Kafka, SQS, Pulsar, etc.)
  • Strong understanding of relational databases (MySQL/PostgreSQL), including query optimization, replication, and failover strategies
  • Familiarity with NoSQL and in-memory databases (Redis, DynamoDB, MongoDB)
  • Experience with distributed systems, microservices, capacity planning, and fault tolerance
  • Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK/OpenSearch, OpenTelemetry)
  • Strong understanding of Linux systems and networking fundamentals (TCP/IP, DNS, HTTP/HTTPS, TLS, load balancing)
  • Experience with SRE practices, including SLOs/SLIs/SLAs, load testing, resilience testing, and incident management
  • Strong communication skills and ability to collaborate across engineering teams
  • Calm and effective during incidents with an ownership mindset

Nice To Haves

  • Experience operating production systems written in Ruby, Java, or other major platforms
  • Framework experience such as Ruby on Rails, Spring Boot, or similar
  • Experience operating high-traffic SaaS platforms
  • Cost optimization in cloud environments
  • Chaos engineering practices
  • Experience mentoring junior engineers

Responsibilities

  • Design, build, and operate reliable, scalable distributed systems
  • Improve system availability, performance, and resilience
  • Automate infrastructure, deployments, and operational processes
  • Diagnose and resolve production issues
  • Lead upgrades and migrations with minimal or zero downtime
  • Participate in on-call rotations and incident response
  • Collaborate closely with development teams to improve operability
  • Drive best practices around monitoring, alerting, and capacity planning
  • Reduce operational toil through automation
  • Contribute to incident management, post-mortems, disaster recovery strategies, and continuous reliability improvements

Benefits

  • International projects
  • In-office, hybrid, or remote flexibility
  • Medical healthcare
  • Recognition program
  • Ongoing learning & reimbursement
  • Well-being program
  • Team events & local benefits
  • Sports compensation
  • Referral bonuses
  • Top-tier equipment provision
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service