About The Position

The Senior Platform Engineer designs, builds, modernizes, and operates the enterprise compute, virtualization, storage, and backup platform across plants, data center, offices, cloud environments, and remote users. This role owns the compute and resilience platform end to end, including architecture, automation, capacity management, disaster re covery , and operational performance. The position emphasizes Infrastructure as Code, automation first practices , AI-enabled operations, disaster recovery readines s and reduction of technical debt to deliver resilient, scalable, and secure compute services aligned to enterprise strategy. The full salary range for this position is $111,200 – $166,800 . However, our current budget for a new hire is $111,200 – $150,000 , depending on the candidate's specific experience and skills.

Requirements

  • Bachelor’s degree in a related field or equivalent practical experience
  • 8+ years of progressive experience in infrastructure or platform engineering, including at least 5 years in a senior-level role within a defined technology domain
  • Senior-level hands-on experience designing and operating enterprise technology platforms in multi-site and hybrid environments
  • Strong understanding of high availability, resiliency, and enterprise architecture principles
  • Experience with Infrastructure as Code, configuration management, or templated infrastructure practices
  • Advanced automation and scripting experience, such as PowerShell, Python, Terraform, Ansible, or similar
  • Experience with monitoring, logging, telemetry, and AIOps or AI-assisted monitoring platforms
  • Strong understanding of security principles, access controls, and compliance considerations
  • Experience supporting hybrid environments integrating on-premise infrastructure, cloud platforms, and SaaS services
  • Ability to operate production systems with strict uptime and reliability requirements
  • Experience documenting architecture, standards, and operational procedures
  • Familiarity with AI-assisted monitoring or analytics platforms that improve operational effectiveness
  • Ability to communicate complex technical concepts clearly to both technical and non-technical audiences.
  • Strong written communication skills for architecture documentation, standards, and executive-level summaries.
  • Ability to evaluate complex technical environments, synthesize system data, assess risk, and make sound architectural and operational decisions in dynamic or ambiguous situations.
  • Demonstrates strong analytical thinking, structured problem-solving, and the ability to balance business impact with technical considerations.

Nice To Haves

  • Experience supporting manufacturing or operational technology environments
  • Experience working with MSPs providing NOC or SOC services
  • Familiarity with CI/CD practices for infrastructure
  • Experience leading modernization or transformation initiative
  • Relevant industry certifications aligned to the platform domain such as cloud, networking, security, or automation credentials are preferred and may strengthen candidacy.

Responsibilities

  • Own the enterprise compute, virtualization , storage, and backup platforms across plants, warehouses, offices, cloud, and remote environments
  • Design for high availability, fault tolerance, scalability, and rapid recovery
  • Ensure platform reliability supports manufacturing uptime, enterprise operations, and business continuity
  • Serve as technical authority for compute architecture, virtualization standards, storage design, and resilience strateg y
  • Drive modernization, standardization, and lifecycle management of servers, hypervisors, storage arrays, and backup platforms
  • Reduce technical debt and eliminate configuration drift
  • Act as a technical mentor and escalation point within the platform domain
  • Design and implement resilient, secure, and scalable compute, virtualization, and storage architectures
  • Define and maintain standards, reference designs, and best practices for server builds, cluster design, hypervisor configuration, and storage layout
  • Lead platform upgrades, hypervisor migrations, storage refreshes, and modernization initiatives
  • Ensure integration with adjacent platforms such as network, security, cloud, identity, data, and applications
  • Support hybrid environments spanning on-premises infrastructure, cloud compute platforms (Azure AWS), and SaaS workloads
  • Design and maintain high-availability clusters and disaster recovery configurations
  • Define compute and infrastructure configurations using code, templates, or structured configuration management tools
  • Establish version-controlled configurations as the system of record or server builds, hypervisor configurations, and storage policies
  • Enable repeatable, low-risk changes through standardized deployment models
  • Reduce manual changes and operational inconsistencies
  • Contribute to CI/CD practices for infrastructure or platform changes
  • Maintain version-controlled repositories as the authoritative source of platform configuration
  • Automate server provisioning, patching, lifecycle management, validation, recovery, and compliance validation
  • Reduce manual operational effort through scripting and workflow automation
  • Partner with MSPs to ensure consistent execution of b ackup , recovery, and infrastructure runbooks
  • Improve monitoring signal quality a cross compute, storage, and virtualization layers
  • Design self-healing or auto-remediation capabilities where appropriate
  • Continuously optimize resource utilization, performance, and capacity planning
  • Ensure compute platform resilience, redundancy, backup, and disaster recovery alignment
  • Own backup, recovery, and disaster recovery design and testing processes
  • Maintain documented recovery procedures and conduct periodic DR exercises
  • Partner with Security teams to maintain compliance, segmentation, access controls, and monitoring standards
  • Support enterprise risk management initiatives related to infrastructure stability , ransomware protection, and business continuity
  • Leverage AI-driven monitoring and analytics to detect anomalies and performance risks
  • Support predictive insights related to compute utilization, storage growth and failure trends
  • Contribute to AI-assisted incident investigation and root cause analysis where tooling supports it
  • Identify opportunities to reduce alert fatigue and improve operational insight using intelligent tooling
  • Partner closely with the business and other IT teams
  • Provide clear architecture diagrams, standards documentation, and operational runbooks
  • Participate in Tier-2 or Tier-3 escalation and on-call rotations within the platform domain
  • Act as secondary or tertiary responder for critical enterprise outages
  • Support cross-functional initiatives tied to modernization and transformation
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service