Sr Site Reliability Engineer

EricssonSanta Clara, CA
15hHybrid

About The Position

Design, architect, and implement cloud infrastructure for EWS, optimizing Kubernetes platforms for cloud-native workloads across compute, storage, and networking layers Build and maintain automation tooling for infrastructure provisioning, monitoring, and operations using Infrastructure as Code practices Develop and operate SRE systems leveraging Large Language Models (LLMs) for AI-driven operations and intelligent automation Implement and maintain cloud security practices including vulnerability scanning, security monitoring, compliance automation, and incident response Provide SRE on-call support covering North American time zones Collaborate across research, product development, architecture, and service teams on AI solutions for 5G/6G and IoT systems, including predictive operations, anomaly detection, and intelligent network analysis

Requirements

  • Cloud-Native Infrastructure: Deep hands-on experience operating production-grade Kubernetes environments, including troubleshooting, performance tuning, and capacity planning
  • Linux Systems: Strong Linux administration skills including systemd, networking (bridges, VLANs, routing), storage, and performance optimization
  • Automation & IaC: Proficiency in Python, Bash, and Go; experience with Ansible, Terraform, or similar automation frameworks
  • Containerization: Solid understanding of container runtimes (containerd, Docker), image management, and container orchestration
  • Networking: Experience with Kubernetes networking (CNI, Cilium, Calico), load balancing, and service mesh concepts
  • Storage Systems: Hands-on experience with cloud-native storage solutions (Ceph, NFS, object storage) and Kubernetes storage concepts (CSI, StorageClass, PVC)
  • Observability: Experience with monitoring, logging, and alerting tools (Prometheus, Grafana, Loki, or similar)
  • CI/CD: Experience with GitLab CI, Jenkins, or similar platforms for building automated pipelines
  • AI-Powered Development: Experience with AI-assisted coding tools (GitHub Copilot, Cursor, or similar) and LLM-powered automation
  • Bare Metal & Virtualization: Knowledge of bare metal provisioning and virtualization technologies (KVM, libvirt)
  • Secret Management: Familiarity with secret management solutions (Vault, OpenBao)
  • CNCF Ecosystem: Understanding of CNCF landscape and ability to evaluate emerging cloud-native technologies
  • Database Operations: Experience with database systems (MySQL, Redis, PostgreSQL)
  • Bachelor's degree in Computer Science, Software Engineering, or related field (or equivalent practical experience)
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
  • Strong problem-solving skills with a curious, hands-on approach to learning new technologies
  • Excellent collaboration and communication skills for working across distributed teams

Responsibilities

  • Design, architect, and implement cloud infrastructure for EWS
  • Optimizing Kubernetes platforms for cloud-native workloads across compute, storage, and networking layers
  • Build and maintain automation tooling for infrastructure provisioning, monitoring, and operations using Infrastructure as Code practices
  • Develop and operate SRE systems leveraging Large Language Models (LLMs) for AI-driven operations and intelligent automation
  • Implement and maintain cloud security practices including vulnerability scanning, security monitoring, compliance automation, and incident response
  • Provide SRE on-call support covering North American time zones
  • Collaborate across research, product development, architecture, and service teams on AI solutions for 5G/6G and IoT systems, including predictive operations, anomaly detection, and intelligent network analysis
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service