IT Operations Senior Manager

FanDuelAtlanta, GA
10d

About The Position

FanDuel is looking for a dynamic Senior Manager, IT Operations (AIOps & Incident Automation) to lead a globally distributed 24/7 IT Operations function and a human-on-the-loop team focused on automating end-to-end incident management across our products (Sportsbook, Casino, Fantasy, Retail, Racing, and more). This role combines hands-on technical leadership with people management to reduce operational toil and improve reliability through AIOps, workflow orchestration, runbook automation, and data-driven prevention. You will ensure automation is safe and auditable, with the right human oversight for high-impact decisions. Reporting to the Sr. Director, Tech Ops, you will partner closely with Engineering, SRE, and Service Management to shift operational ownership left, improve production readiness, and drive preventative actions that reduce incident frequency and customer impact. In addition to the specific responsibilities outlined above, employees may be required to perform other such duties as assigned by the Company. This ensures operational flexibility and allows the Company to meet evolving business needs.

Requirements

  • Bachelor’s or master’s degree in Computer Science, Engineering, or equivalent practical experience is preferred.
  • 7+ years of experience in production operations (IT Ops, SRE, NOC, or similar), including 5+ years leading people and/or managers in a 24/7 environment is preferred.
  • Experience improving reliability through automation and operational excellence, including incident lifecycle improvements and post-incident prevention.
  • Hands-on experience designing automation and workflows using scripting or programming (e.g., Python), APIs, and orchestration tools.
  • Strong understanding of observability (monitoring, logging, tracing), alerting strategy, and incident response best practices.
  • Experience partnering with Engineering/SRE to drive shift-left initiatives and influence service ownership, production readiness, and on-call standards.
  • Comfortable with AIOps concepts (event correlation, anomaly detection, noise reduction) and human-on-the-loop oversight for automated decisioning.
  • Excellent communication skills, including the ability to translate complex technical issues to non-technical stakeholders and senior leaders.
  • Strong judgment under pressure with a bias for action, accountability, and continuous learning.
  • Strong understanding of cloud services and modern infrastructure (e.g., AWS, Google Cloud, Azure), including containerized and distributed systems.

Nice To Haves

  • Experience implementing or operating AIOps/incident orchestration platforms and integrating them with ITSM, paging, and collaboration tools.
  • Experience with Service Management and incident tooling (e.g., ServiceNow, Jira, PagerDuty/Opsgenie) and building automation around them via APIs.
  • Experience in regulated or compliance-driven environments (e.g., SOX, SOC 2) with strong documentation and audit practices.
  • Familiarity with GenAI/LLM-assisted operations (prompting, evaluation, guardrails) and an interest in safely scaling automation in production.
  • Willing to work nights, weekends, holidays, if necessary, as well as being on-call for key events and major incidents

Responsibilities

  • Lead and develop a team of Technical Operations Engineers setting clear expectations for 24/7 coverage, quality, and customer impact.
  • Own the AIOps and incident automation roadmap, including event correlation, alert noise reduction, auto-triage, automated communications, and runbook execution.
  • Drive preventative actions through trend analysis, problem management, recurring incident elimination, and strong follow-through on post-incident action items.
  • Implement and continuously improve ITIL-aligned incident, problem, and change practices with a focus on speed, clarity, and learning.
  • Act as an escalation point for major incidents (P1/P2) and coordinate real-time response, stakeholder communications, and executive updates.
  • Partner with Engineering and SREs to shift left: strengthen production readiness, on-call hygiene, runbooks, alert quality, and self-service remediation patterns.
  • Define and improve observability and operations analytics (metrics/logs/traces), ensuring actionable alerting and clear service health signals.
  • Track and report on key operational metrics (MTTD/MTTR, uptime, alert volume, automation coverage, incident recurrence, toil reduction, SLA/SLO performance).
  • Establish guardrails for AI and automation (human approval workflows, auditability, rollback plans, and change control) appropriate for a regulated environment.
  • Manage third-party providers and tooling integrations, enforcing SLAs and continuously improving reliability of the end-to-end operational toolchain.

Benefits

  • We offer amazing benefits above and beyond the basics. We have an array of health plans to choose from (some as low as $0 per paycheck) that include programs for fertility and family planning, mental health support, and fitness benefits.
  • We offer generous paid time off (PTO & sick leave), annual bonus and long-term incentive opportunities (based on performance), 401k with up to a 5% match, commuter benefits , pet insurance, and more - check out all our benefits here: FanDuel Total Rewards .
  • Benefits differ across location, role, and level.
  • This role may offer the following benefits: medical, vision, and dental insurance; life insurance; disability insurance; a 401(k) matching program; among other employee benefits.
  • This role may also be eligible for short-term or long-term incentive compensation, including, but not limited to, cash bonuses and stock program participation.
  • This role includes paid personal time off and 14 paid company holidays.
  • FanDuel offers paid sick time in accordance with all applicable state and federal laws.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service