Site Reliability Engineer (SRE)

SoftPro•Raleigh, NC

1d•Remote

About The Position

SoftPro is seeking a well-rounded Site Reliability Engineer (SRE) to join our amazing Cloud Operations Team in our Raleigh, NC office or as a remote employee. This team supports our solutions using the latest Microsoft technologies and hosting services including Microsoft Azure, encouraging your design input and creativity. The ideal candidate will develop in-depth knowledge of the platforms that run our solutions. This position will ensure our SLA’s are met through the engineering solutions the team develops.

Requirements

Infrastructure as Code: Ability to provision Cloud Infrastructure, services and resources using Infrastructure as Code, development of Terraform Modules, leveraging Ansible to define and set configurations of deployed resources.
Automation: Skills necessary to conduct automation of tasks and infrastructure orchestration leveraging PowerShell, AZ CLI, Ansible, CI/CD and other languages / tools.
Operating Systems: Strong understanding of Linux/Unix command line and administration, Windows Server OS administration. Experience with Building, Deploying, Patching & Managing Orchestration platforms (Service Fabric & Kubernetes), VM Scale Sets and other Azure compute workload types.
Proven experience with and familiarity with containerization and technologies like Docker & Kubernetes & cloud native applications.
Strong knowledge of distributed computing concepts and microservices.
Ability to manage work via ticketing system and source control (for example: Jira / DevOps / TFS / Git Repos)
Familiarity with CI/CD pipeline development and associated tools.
Working knowledge of application performance monitoring & alerting platforms / systems (For example – Azure Monitor, Application Insights, AWS CloudWatch as well as commercial off the shelf solutions)
Experience architecting & implementing High Availability / Disaster Recovery Solutions for Cloud Platforms constructed with Public Cloud IaaS & PaaS resources.
Databases: Knowledge of / familiarity with structured & document database technologies (MS SQL Server, MySQL, MongoDb / Azure CosmosDb)
Experience in support and configuration of reverse proxy technology stacks such as NGINX and HAProxy.
Strong operational knowledge of Identity Access Management, Authentication & Authorization, Role Based Access Control configurations.
Collaboration & Teamwork: Strong customer service skills and able to work in a team environment, with developers and operations teams. Experience supporting Highly available systems & production operations.
Communication: Proven ability to clearly and concisely communicate technical information to both technical and non-technical audiences.
Incident Management: Ability to think clearly and logically under pressure during system outages or impactful events. Able to adapt to the constant changing landscape of technologies, systems and environments.
Self-driven learner - must exhibit a high-level of logical & analytical skills as well as attention to detail.

Nice To Haves

An understanding of industry standard authentication protocols such as OpenID Connect (OIDC), OAuth 2.0 and Security Assertion Markup Language (SAML)
Operational experience and knowledge of the OSI model, software defined networks and public cloud network infrastructure / technologies, design principles, Network Security.
Understanding of SDLC, Object oriented concepts and software development languages
Audit & Compliance experience.

Responsibilities

Produce and refine observability metrics associated with our platforms and the user experience.
Help to refine our service level indicators monitoring capabilities with the goal of proactively identifying and resolving issues.
Participate in the development & architecture of our operational systems and solutions to include performance, application health, security, and scalability for all cloud-based environments.
Develop and maintain our Infrastructure as Code used to build alerting and configure monitoring systems.
Build and maintain CI/CD automation pipelines to support the services and systems we provide.
Demonstrate your technical curiosity through your drive to understand services leveraged in depth.
Collaborate with product development and engineering teams in support of future organizational direction.
Provide incident response; triage production issues in our software products including the underlying Cloud Infrastructure & service providers.
Define, implement and assess system reporting and monitoring needs.
Participate in team activities (standups, backlog refinements, blameless postmortems, product demonstrations / reviews).
Share your passion for problem solving and continuous improvement with the team.