SRE - Observability

Encora

1d•Remote

About The Position

Are you a technical detective with a passion for scalable, performant, and resilient enterprise applications? Do you thrive in incident management and root cause analysis? Are you excited to collaborate with development teams to enhance application performance and reliability? If so, we want to meet you! Encora is seeking a Senior Application Support Engineer (SRE) to join our dynamic team of consultants. In this role, you’ll lead efforts to ensure the reliability, availability, and performance of mission-critical applications and platforms.

Requirements

2–5 years of experience in Tier 2 or Tier 3 IT support roles (e.g., systems analysis, development, data/reporting).
Proficiency with observability tools (OpenTelemetry, Splunk Cloud/Observatility Cloud, AppDynamics, Grafana, Datadog. Splunk preferred).
Familiarity with Synthetic Testing (Splunk Synthetics, Selenium, etc)
Familiarity with AWS and/or Kubernetes architecture.
Strong ability to analyze logs and code to resolve Tier 2 issues.
Experience in application-focused support engineering or SRE roles.
Excellent written and verbal communication skills.
Background in DevOps and scripting (Python preferred).
Familiarity with ITIL practices, incident management, and documentation.
Experience with disaster recovery, business continuity, and ServiceNow dashboards.
Comfortable working in Linux environments and shell scripting.

Nice To Haves

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Consulting experience and Agile methodology background.
Proven ability to lead small to medium-sized teams.
Certifications: ITIL Foundation, AWS, Azure, or GCP.
Experience with Mulesoft, Postman, and API testing support.
Experience writing complex SPL (Splunk Search Processing Language) queries for alerting/dashboarding.
Experience with Application Performance Monitoring (APM) and Real User Monitoring (RUM)
Understanding of networking concepts in cloud-native environments (AWS/Kubernetes/OpenShift).

Responsibilities

Be part of a global production operations team responsible supporting external facing web applications.
Manage incidents, perform root cause analysis, and implement preventative solutions.
Collaborate with development, infrastructure, and platform engineering teams to improve system reliability.
Work within a global team to enable 24x7 support model for internal triage, communication, and root cause analysis.
Provide 24/7 support as part of a team for production web applications running on AWS or Mulesoft APIs.
Monitor and troubleshoot issues using observability tools like Splunk.
Create, document, and iterate application monitoring including alerts and synthetic tests
Investigate and expand functionality in Splunk to enable Splunk AI solutions.
Create dashboards and capture metrics to improve visibility and performance.
Respond proactively to system alerts and customer complaints.
Apply industry best practices to support processes.
Participate in a planned on-call rotation.