Assoc Engineer SRE

OptimumTown of Oyster Bay, NY
3d

About The Position

We are Optimum, a leader in the fast-paced world of connectivity, and we're on the hunt for enthusiastic professionals to join our team! We understand that connectivity isn't just a luxury anymore – it's a necessity that empowers lives, fuels businesses, and drives innovation. A career at Optimum means you'll be enabling progress and enhancing lives by providing reliable, high-speed connectivity solutions that keep the world connected. We owe our success to our amazing product, commitment to our people and the connections we make in every community. If you are resourceful, collaborative, team-oriented and passionate about delivering consistent excellence, Optimum is the Company for you! We are Optimum!Job SummaryAs a Site Reliability Engineer I, you are the frontline engine of our hybrid platform. This role is focused on service continuity and active incident response. You will work shifts to provide support coverage, perform real-time debugging, and keep our GCP and On-Premises Unix/Linux systems running at all times.The Mission: Real-Time ReliabilityYour mission is to maintain 100% platform visibility. You will be the primary responder to our observability stack, moving beyond simple monitoring to active debugging and remediation. You will handle the "heavy lift" of shift-based support calls and system health checks, ensuring that technical debt is addressed and service disruptions are mitigated before they impact the business. At Optimum, we're fueled by our four core pillars: Taking Ownership, Upholding Transparency, Creating Community, and Demonstrating Expertise. Our commitment to empowering employees to take responsibility and embrace proactive problem-solving underpins Taking Ownership. Upholding Transparency is at the core of our culture, with open and honest communication fostering trust among our dedicated team and loyal customers. Creating Community is more than a goal; it's our daily commitment to fostering an environment of collaboration, innovation, and positivity. Demonstrating expertise is a promise we uphold through continuous learning and engagement with our customers to consistently deliver top-quality products and services. These pillars not only shape our culture but define Optimum as a place of excellence, trustworthiness, and thriving community, and we invite you to be a part of our journey. If you have the drive to succeed and are ready to embark on a thrilling career, seize this opportunity today, and join our winning team, so together, we'll shape the future of connectivity.

Requirements

  • Bachelor's degree in Telecommunications, Computer Engineering, or related technical field.
  • 0-2years of experience in mobile network operations or systems engineering roles.
  • OS Internals: Foundational command-line proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX). Ability to troubleshoot CPU/Memory/Disk bottlenecks.
  • Debugging Skills: Familiarity with log analysis tools (Loki) and the ability to correlate metrics (Prometheus) to find root causes.
  • Cloud & Containers: Basic understanding of GCP (Compute Engine, GKE) and Kubernetes (restarting pods, viewing logs, checking ingress).
  • Kafka Awareness: Basic understanding of Kafka topics and the ability to monitor consumer group health.
  • Automation Exposure: Ability to run and verify Ansible playbooks and Terraform plans.
  • Communication: Excellent verbal communication for handling support calls and providing clear updates during high-pressure incidents.

Responsibilities

  • Shift-Based Support & Triage: Act as the primary technical point of contact during your shift. Manage the support queue, answer urgent infrastructure calls, and provide initial triage for all system anomalies.
  • Active Debugging: Investigate and resolve service issues across the stack. This includes debugging Kubernetes pod failures, resolving Kafka consumer lag, and troubleshooting Unix/Linux system errors using logs (Loki) and traces (Tempo).
  • Hybrid Platform Maintenance: Execute routine standardization tasks and health audits for Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments to prevent environment drift.
  • Infrastructure Stewardship (DC Support): Perform on-site "Smart Hands" support in our Bethpage data center, including hardware reboots, component swaps, and verifying physical power/network redundancy.
  • Unified Observability: Maintain the "single pane of glass" (Prometheus/Grafana). Create and tune alerts to ensure the engineering team is notified of critical issues while minimizing "alert fatigue."
  • Escalation & Post-Mortems: Follow strict escalation paths to SRE2/SRE3 leads, Assist in complex outage mitigation. Contribute detailed timelines and log data to Blameless Post-Mortems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service