Site Reliability Engineer

Cayuse HoldingsCedar Park, TX
1d$120,000 - $160,000Hybrid

About The Position

The Site Reliability Engineer will be instrumental in ensuring the reliability, performance, and scalability of systems and services critical to our operations. This role bridges the gap between development and operations by leveraging a software engineering approach to address system administration challenges. This role requires a blend of software development expertise and operational excellence to meet evolving business and technical goals. This position aligns with Cayuse’s core values of Innovation, Excellence, Collaboration, Adaptability, and Integrity by fostering technical solutions that meet customer needs, promoting teamwork, and prioritizing quality in deliverables.

Requirements

  • Exceptional interpersonal skills with the ability to communicate in a clear, professional, and articulate manner.
  • Exceptional verbal and written communication skills.
  • Excellent organizational, analytical, and problem-solving skills with high-level attention to detail.
  • Ability to analyze systems and procedures
  • Strong multitasking skills with the ability to manage multiple design streams across concurrent work effort.
  • Must be self-motivated and able to work well independently as well as on a multi-functional team.
  • Ability to handle sensitive and confidential information appropriately

Nice To Haves

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • Experience setting up and managing Site Reliability Engineering frameworks in medium to large-scale organizations.
  • Familiarity with agile methodologies, ITIL processes, and reliability-focused engineering paradigms like Chaos Engineering or SLOs (Service Level Objectives).
  • Certification in relevant areas such as cloud computing or DevOps is a plus.
  • Proven experience in system reliability engineering, DevOps, system administration, or software development.
  • Strong understanding of distributed systems, networking, and system architectures.
  • Proficiency with monitoring and observability tools (e.g., Grafana, Prometheus, Datadog, Splunk) to track performance and system health.
  • Hands-on experience with automation tools (e.g., Terraform, Ansible) and coding/scripting languages, such as Python, Go, or Bash.
  • Solid understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Experience working with cloud platforms (e.g., AWS, Azure, Google Cloud) and the ability to design scalable cloud-based architectures.
  • Knowledge of CI/CD pipelines, source control tools (e.g., Git), and configuration-as-code principles.

Responsibilities

  • Ensure the high availability and reliability of critical systems and services across production and development environments.
  • Monitor and improve system latency, performance, and overall efficiency through proactive measures and tuning.
  • Conduct performance benchmarking, capacity analysis, and optimize workloads for scalability and cost-effectiveness.
  • Act as the first line of defense for incidents, managing and resolving emergencies and minimizing downtime.
  • Implement post-incident reviews and root cause analyses to enhance system reliability and prevent recurring issues.
  • Manage system changes, ensuring they are properly planned, tested, and implemented with minimal risk.
  • Develop automation tools and scripts to eliminate manual, repetitive tasks and improve operational efficiency.
  • Enhance CI/CD pipelines and deployment processes to ensure rapid and stable software delivery.
  • Use Infrastructure as Code (IaC) principles to build, maintain, and scale system infrastructure.
  • Design and implement robust monitoring and alerting solutions to detect anomalies and resolve issues proactively.
  • Leverage data from monitoring tools to gain insights into system behavior, user impact, and opportunities for improvement.
  • Collaborate closely with development teams to embed reliability principles into the application development lifecycle.
  • Advocate for best practices in system design, development, and operationalizing software applications.
  • Support the handover of new applications and features into production through documentation, training, and readiness reviews.
  • Other duties as assigned.

Benefits

  • Medical, Dental and Vision Insurance; Wellness Program
  • Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
  • Short-Term and Long-Term Disability options
  • Basic Life and AD&D Insurance (Company Provided)
  • Voluntary Life and AD&D options
  • 401(k) Retirement Savings Plan with matching after one year
  • Paid Time Off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service