Senior Systems Administrator

Nasuni•Boston, MA

14h•Hybrid

About The Position

We are seeking a Senior Systems Administrator to own and operate the virtualization and internal infrastructure platform that powers Nasuni’s Engineering organization. This role is responsible for ensuring the reliability, scalability, automation, and observability of our enterprise VMware environment (100+ hosts across multiple clusters), CI/CD infrastructure, log aggregation systems, and the hardware stack supporting development workloads. This is a deeply hands-on infrastructure role for someone with strong VMware expertise and advanced Infrastructure-as-Code capabilities (Packer, Terraform, Ansible) who can modernize systems through automation while maintaining production stability. This is not a management role. It is a senior individual contributor position with clear domain ownership. Owns the Engineering virtualization and internal infrastructure platform end-to-end. Operates with high autonomy within defined infrastructure boundaries. Makes technical decisions for virtualization, storage, automation, CI/CD infrastructure, and observability tooling. Leads lifecycle management including upgrades, patching, capacity planning, and reliability improvements. Designs and implements infrastructure-as-code standards and reusable automation frameworks. Owns CI/CD execution environments (Jenkins agents, GitHub Actions runners). Owns log aggregation and monitoring infrastructure. Partners cross-functionally with Engineering, DevOps, and IT. Drives modernization within scope but does not define company-wide infrastructure strategy.

Requirements

7+ years of Systems Administration or Infrastructure Engineering experience.
Demonstrated ownership of large-scale VMware environments (multi-cluster, 50+ hosts minimum).
Strong production experience with Terraform, Packer, and Ansible.
Experience designing reusable Infrastructure-as-Code modules and automated image pipelines.
Production experience supporting CI/CD platforms (Jenkins, GitHub Actions).
Experience implementing or maintaining log aggregation or observability platforms.
Strong Linux systems administration experience.
Experience managing high-availability production environments.
Strong troubleshooting and root cause analysis capabilities.
Ability to independently manage complex infrastructure domains.

Nice To Haves

Experience with Dell PowerStore or enterprise storage platforms.
Production Kubernetes or container orchestration experience.
Experience supporting internal developer platforms.
Networking experience (switching, fabric, datacenter networking).
Experience integrating observability into CI/CD pipelines.
Hardware lifecycle and datacenter operations experience.
Designed automation frameworks adopted across engineering teams.
Reduced infrastructure-related CI/CD failures through platform optimization.
Implemented centralized log aggregation strategy improving MTTR.
Led virtualization modernization initiatives with measurable impact.
Delivered automation reducing operational workload by 25%+

Responsibilities

Administer and optimize VMware vCenter across 100+ hosts and multiple clusters.
Manage Dell PowerStore backend storage and ensure high availability.
Maintain internal Linux and Windows VM environments.
Oversee physical hardware lifecycle management for virtualization platforms.
Perform lifecycle upgrades, patching, and performance tuning.
Design and maintain reusable Infrastructure-as-Code frameworks using Terraform.
Build and maintain automated VM image pipelines using Packer.
Develop configuration management workflows using Ansible.
Standardize and version infrastructure definitions across environments.
Reduce manual operational tasks through automation and orchestration.
Own and optimize CI/CD execution infrastructure (Jenkins agents, GitHub Actions runners).
Improve reliability, performance, and scalability of build pipelines.
Troubleshoot pipeline execution failures tied to infrastructure constraints.
Partner with engineering teams to integrate infrastructure automation into workflows.
Own and maintain log aggregation and monitoring platforms (e.g., Prometheus, Grafana, Loki, ELK, or equivalent).
Ensure system visibility across compute, storage, and CI/CD workloads.
Implement performance dashboards and alerting standards.
Drive improvements in system observability and incident detection.
Ensure disaster recovery readiness and high availability.
Lead root cause analysis for infrastructure incidents.
Improve system hardening and security posture.
Maintain documentation and operational standards.