Platform Engineer (Reliability)

AdvancedMD•South Jordan, UT

1d•Hybrid

About The Position

AdvancedMD is a unified cloud suite of medical office software hosted on Amazon Web Services/AWS including practice management, electronic health records, and patient engagement, and offers managed medical billing services for independent practices. AdvancedMD serves an expansive national footprint of 65,000 practitioners across 14,000 practices and 900 independent medical billing companies. 8.8M insurance claims are processed every month on the AdvancedMD billing platform! Role Summary Are you passionate about building reliable, scalable systems that power mission-critical applications? AdvancedMD is seeking a skilled and motivated Platform Engineer (Reliability) or (SRE) to join our growing ITSecOps organization. In this role, you’ll help bridge the gap between development and operations—applying software engineering principles to infrastructure and operations to improve reliability, performance, and efficiency across our cloud-based SaaS platform. As a Site Reliability Engineer, you’ll play a key role in ensuring system uptime, performance, and resilience through automation, observability, and proactive capacity planning. You’ll collaborate closely with Product, Engineering, and IT Operations teams to build and maintain reliable cloud-native systems using AWS, Kubernetes, Terraform, and modern monitoring tools. This is an exciting opportunity to join a healthcare technology leader where your technical expertise, problem-solving skills, and passion for automation will directly enhance the availability and performance of applications used by healthcare professionals nationwide. If you love tackling complex reliability challenges in a fast-paced DevSecOps culture, we’d love to have you on our team.

Requirements

Bachelor’s degree in Computer Science or related field, or equivalent professional experience
3+ years of experience in a technical or operations engineering role in a highly regulated environment
Hands-on experience with cloud platforms. Primarily AWS (EC2, RDS, Route53, S3, ECS, Lambda, IAM, VPC, CloudFront) but Azure/GCP are a plus
Proficiency in one or more scripting or programming languages: PowerShell, Python, Bash, C#, Golang, or TypeScript
Experience managing Windows Server and SQL Server environments; familiarity with Linux administration (Ubuntu)
Experience with Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation
Knowledge of containerization and orchestration technologies, such as Kubernetes and ArgoCD
Familiarity with source control (Azure DevOps) and work management tools (Jira, Confluence)
Experience with monitoring, APM, and log aggregation tools such as Splunk, Prometheus, Grafana, Nagios, CloudWatch
Familiarity with distributed tracing concepts and experience using OpenTelemetry to instrument, collect, and analyze telemetry data
Understanding of networking fundamentals, automation frameworks, and DevOps principles
Familiarity with AI tooling and its application in modern development environments to streamline coding and problem solving

Nice To Haves

You approach reliability like an engineer — automating your way out of repetitive tasks and designing systems that heal themselves.
You’re calm under pressure — when incidents occur, you bring structure, communication, and resolution without chaos.
You think in systems — spotting weak points, planning capacity ahead, and improving processes before issues arise.
You thrive in collaboration — partnering with developers, DBAs, and platform teams to deliver measurable improvements in performance and uptime.
You’re a lifelong learner — constantly exploring new tools, AWS services, and reliability practices to keep our systems modern, secure, and efficient.
You’re proactive — you don’t wait for alerts; you anticipate them.

Responsibilities

Ensure proper monitoring, alerting, and observability across production and development environments
Collaborate with Product, Engineering, and IT Operations teams to identify and resolve issues affecting application performance and stability
Design and build self-service tools and automation to reduce manual operational work and improve response times
Participate in Change Management and Incident Review processes, contributing to root cause analysis and long-term fixes
Develop and enhance operational SLOs, SLIs, and SLAs in partnership with engineering teams
Automate scaling and recovery processes to improve system resilience
Support services before they go live through design reviews, capacity planning, and operational readiness assessments
Participate in a shared on-call rotation to ensure 24x7 production system reliability
Continuously evaluate and adopt emerging technologies to optimize performance, cost efficiency, and automation
Contribute to a healthy and collaborative engineering culture through documentation, mentorship, and teamwork