About The Position

Platform team of data scientists, researchers, and other experienced engineers where you will actively participate the technical direction, ensuring the performance, reliability, and scalability of AI systems Software Engineer to join our platform team, focused on Kubernetes and cloud-native systems, including Red Hat OpenShift and other managed Kubernetes platforms (GCP GKE, AWS EKS, Azure AKS). This role involves extending Kubernetes/OpenShift capabilities (e.g., building operators, controllers, CRDs, and webhooks) and supporting machine learning workloads in production, including exposure to OpenShift AI products. The ideal candidate has a strong understanding of how Kubernetes/OpenShift manages workloads and resources internally and can troubleshoot, optimize, and scale cloud-native systems.

Requirements

  • 5+ years of software engineering experience, including building and maintaining large-scale, cloud-native systems.
  • Proficiency in Golang or Python, with experience extending Kubernetes or OpenShift using operators, controllers, CRDs, or webhooks. Experience with other managed Kubernetes platforms (GKE, EKS, AKS) is also acceptable.
  • Strong knowledge of Kubernetes/OpenShift control plane internals, including how the system schedules, manages, and reconciles workloads across clusters.
  • Experience supporting ML workloads, such as integrating model serving, autoscaling, or data processing pipelines, including OpenShift AI products.
  • Strong coding and problem-solving skills, with experience troubleshooting, optimizing, and debugging complex systems.

Nice To Haves

  • 2+ years of machine learning experience, including developing, deploying, or monitoring models in production.
  • Hands-on experience with ML frameworks like TensorFlow or PyTorch, and building automated ML pipelines.
  • Familiarity with MLOps tools (e.g., Kubeflow Pipelines, MLflow, Airflow, or KServe) to automate ML workflows.
  • Knowledge of Large Language Models (LLMs) and RAG systems, including LangChain or vector databases (e.g., Pinecone, FAISS).
  • Experience working in cloud environments (AWS, GCP, Azure), integrating services with CI/CD pipelines and observability tools.

Responsibilities

  • Design scalable Kubernetes platforms to support ML workloads and Large Language Models (LLMs).
  • Provide client support for deploying AI/ML workloads, including Re-ranking and Embedding as a Service (RAGaaS).
  • Contribute to the development of end-to-end ML pipelines, from data ingestion to deployment.
  • Implement MLOps best practices for model monitoring, logging, and maintenance in production.
  • Collaborate with data scientists to optimize model performance and integration.
  • Contribute to data governance and security standards for AI/ML workloads.
  • Develop automation tools to streamline infrastructure management and improve efficiency.
  • Stay updated on the latest trends in AI/ML to influence platform enhancements.

Benefits

  • U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance.
  • Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time.
  • U.S. employees are eligible for paid time away as described below, subject to Cisco’s policies: 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees 1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco
  • Non-exempt employees receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees
  • Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations)
  • 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next
  • Additional paid time away may be requested to deal with critical or emergency issues for family members
  • Optional 10 paid days per full calendar year to volunteer
  • For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco’s policies.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service