Sr. DevOps

ArcherSan Jose, CA
10h

About The Position

As a Senior DevOps Engineer, you will be a key contributor to our infrastructure strategy, focusing on automation, stability, and performance across both cloud and on-premise environments. You will drive best practices in CI/CD, configuration management, and monitoring, with a specific focus on optimizing the deployment and operation of large language models (LLMs) and related technologies.

Requirements

  • 5+ years of professional experience in a DevOps, SRE, or infrastructure engineering role.
  • Deep expertise in containerization and orchestration, specifically Kubernetes (design, deployment, and troubleshooting) and Docker.
  • Strong proficiency in managing infrastructure in both Cloud (e.g., AWS, GCP, Azure) and On-Premise environments.
  • Expert-level administration skills in Linux and strong working knowledge of Windows Server environments.
  • Proven experience with Infrastructure as Code (IaC) and Configuration Management tools (e.g., Terraform, Ansible).
  • High proficiency in scripting and automation using Python and Bash.
  • Extensive experience with monitoring and observability platforms, especially Datadog (or comparable tools like Prometheus/Grafana, New Relic).
  • Hands-on experience deploying and managing technologies related to Large Language Models (LLMs), such as utilizing LiteLLM, OpenRouter, or setting up and managing LLM serving endpoints.

Nice To Haves

  • Experience with specific Kubernetes distributions (e.g., K3s, Rancher, OpenShift).
  • Familiarity with network configuration, firewalls, and security best practices for hybrid environments.
  • Experience in MLOps workflows and related tools (e.g., MLflow, Kubeflow).
  • Certifications such as CKA, CKAD, or relevant cloud provider certifications.

Responsibilities

  • Design, deploy, and manage highly available, scalable infrastructure using Kubernetes and Docker across public cloud (e.g., AWS, GCP, Azure) and on-premise data centers.
  • Develop and maintain robust Configuration Management solutions (e.g., Ansible, Terraform) for consistent environment provisioning and management.
  • Implement and manage CI/CD pipelines to facilitate rapid, reliable, and automated software releases.
  • Administer and troubleshoot operating systems, encompassing both Linux and Windows environments.
  • Implement and optimize observability practices using monitoring tools like Datadog for logging, tracing, and alerting.
  • Spearhead the operational deployment, scaling, and maintenance of LLM infrastructure, leveraging tools like LiteLLM, OpenRouter, or similar LLM orchestration/gateway technologies.
  • Automate repetitive tasks and system operations using scripting languages, primarily Bash and Python.
  • Collaborate closely with development, MLOps, and security teams to ensure infrastructure supports product requirements and compliance standards.
  • Participate in an on-call rotation to ensure service reliability and responsiveness to incidents.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service