Sr Site Reliability Engineer

Ericsson•Santa Clara, CA

15h•Hybrid

About The Position

Design, architect, and implement cloud infrastructure for EWS, optimizing Kubernetes platforms for cloud-native workloads across compute, storage, and networking layers Build and maintain automation tooling for infrastructure provisioning, monitoring, and operations using Infrastructure as Code practices Develop and operate SRE systems leveraging Large Language Models (LLMs) for AI-driven operations and intelligent automation Implement and maintain cloud security practices including vulnerability scanning, security monitoring, compliance automation, and incident response Provide SRE on-call support covering North American time zones Collaborate across research, product development, architecture, and service teams on AI solutions for 5G/6G and IoT systems, including predictive operations, anomaly detection, and intelligent network analysis

Requirements

Cloud-Native Infrastructure: Deep hands-on experience operating production-grade Kubernetes environments, including troubleshooting, performance tuning, and capacity planning
Linux Systems: Strong Linux administration skills including systemd, networking (bridges, VLANs, routing), storage, and performance optimization
Automation & IaC: Proficiency in Python, Bash, and Go; experience with Ansible, Terraform, or similar automation frameworks
Containerization: Solid understanding of container runtimes (containerd, Docker), image management, and container orchestration
Networking: Experience with Kubernetes networking (CNI, Cilium, Calico), load balancing, and service mesh concepts
Storage Systems: Hands-on experience with cloud-native storage solutions (Ceph, NFS, object storage) and Kubernetes storage concepts (CSI, StorageClass, PVC)
Observability: Experience with monitoring, logging, and alerting tools (Prometheus, Grafana, Loki, or similar)
CI/CD: Experience with GitLab CI, Jenkins, or similar platforms for building automated pipelines
AI-Powered Development: Experience with AI-assisted coding tools (GitHub Copilot, Cursor, or similar) and LLM-powered automation
Bare Metal & Virtualization: Knowledge of bare metal provisioning and virtualization technologies (KVM, libvirt)
Secret Management: Familiarity with secret management solutions (Vault, OpenBao)
CNCF Ecosystem: Understanding of CNCF landscape and ability to evaluate emerging cloud-native technologies
Database Operations: Experience with database systems (MySQL, Redis, PostgreSQL)
Bachelor's degree in Computer Science, Software Engineering, or related field (or equivalent practical experience)
5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Strong problem-solving skills with a curious, hands-on approach to learning new technologies
Excellent collaboration and communication skills for working across distributed teams

Responsibilities

Design, architect, and implement cloud infrastructure for EWS
Optimizing Kubernetes platforms for cloud-native workloads across compute, storage, and networking layers
Build and maintain automation tooling for infrastructure provisioning, monitoring, and operations using Infrastructure as Code practices
Develop and operate SRE systems leveraging Large Language Models (LLMs) for AI-driven operations and intelligent automation
Implement and maintain cloud security practices including vulnerability scanning, security monitoring, compliance automation, and incident response
Provide SRE on-call support covering North American time zones
Collaborate across research, product development, architecture, and service teams on AI solutions for 5G/6G and IoT systems, including predictive operations, anomaly detection, and intelligent network analysis

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume