Observability Engineer

Neuberger Berman•New York, NY

7h•$110,000 - $130,000•Hybrid

About The Position

Neuberger Berman’s Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premise environments. You will help build and operate a server monitoring platform that continuously validates service health (24/7) across business-critical systems—including external websites and key infrastructure components (e.g., firewalls, OpenShift). You will design and implement end-to-end monitoring solutions spanning logs, metrics, traces, Service Level Objectives (SLOs), synthetic monitoring, and RUM (Real User Monitoring) to improve reliability, accelerate incident response, and deliver clear visibility into service performance. This is an individual contributor role with strong engineering/scripting expectations (not a pure administrator role, though admin experience is helpful). You will partner closely with application, SRE/DevOps, infrastructure, and security teams and act as a champion/evangelist for observability tooling and standards. The environment includes a current OpenView footprint with a migration to Datadog, with workflows integrating into ServiceNow for incident/ticket routing and escalation.

Requirements

BS/BA in Computer Science, Information Systems, Engineering, or equivalent experience.
5+ years in Observability/APM/SRE/Platform Engineering with a track record of delivering production-grade telemetry and reliability outcomes.
Proficiency operating in both Windows Server and Unix (Linux/Solaris) environments, including service instrumentation, agent/collector deployment, and OS-specific performance analysis.
Strong experience designing and operating distributed tracing, metrics and logging standards, SLOs/error budgets, and actionable alerting using modern observability practices.
Hands-on experience with cloud monitoring across Azure and AWS, integrating platform telemetry into centralized observability solutions.
Hands on experience with Observability/APM suites (OpenView, AppDynamics, Datadog) and network management tools (Network Node Manager, Network Automation, NetProfiler).
Scripting and automation expertise (e.g., Python, PowerShell, Bash) and familiarity with APIs/SDKs; experience using infrastructure-as-code to manage observability configurations (e.g., Terraform) and configuration formats (e.g., YAML).
Demonstrated ability to reduce alert noise and MTTR through correlation, enrichment, and threshold tuning; experience producing service maps, dependency views, and clear dashboards.
Excellent communication and stakeholder management skills, with the ability to explain technical concepts to non-technical audiences.
Ability to work independently and collaboratively in a fast-paced environment; strong documentation habits and attention to detail.

Nice To Haves

Experience with .NET development (C#), including instrumentation patterns for observability in .NET applications.
Experience in financial services or other regulated industries.
Familiarity with ITSM integrations and CMDB alignment for incident, problem, and change processes.
Exposure to APM and monitoring suites and event correlation approaches; knowledge of network monitoring concepts.
Experience with CI/CD integration, synthetic testing strategies, and performance/capacity analysis for latency-sensitive systems.
Relevant certifications in observability, cloud monitoring, or related platforms.

Responsibilities

Partner closely with application, DevOps engineering, SRE/operations, infrastructure, and security teams to understand reliability goals and translate them into scalable monitoring/observability solutions across cloud and on-prem environments (Windows and Unix).
Design, build, and maintain scalable observability architectures and platforms, with ownership of monitoring capabilities for key applications and services (application ownership).
Develop automated processes to continuously scan and validate uptime/health (24/7) for business-critical services, including external-facing websites and supporting infrastructure.
Implement and optimize telemetry collection, alerting, dashboards, and service views; drive adoption of OpenTelemetry (OTel) and consistent logging/metrics/tracing standards (core logging and platform telemetry alignment).
Define and operationalize SLOs and implement actionable alerting strategies that reduce noise and improve MTTR through correlation, enrichment, and threshold tuning.
Implement and evolve APM capabilities and user experience monitoring, including RUM (Real User Monitoring) and synthetic monitoring approaches.
Integrate observability tooling with incident/problem management processes and ITSM workflows (e.g., Datadog ServiceNow); support ticket routing/escalation and produce runbooks, post-incident reviews, and executive/operational reporting.
Automate onboarding and configuration for telemetry, dashboards, monitors, and alerts using scripting and infrastructure-as-code; ensure consistency and repeatability across Windows Server and Unix (Linux/Solaris).
Collaborate on platform evolution and cost/scale optimization, continually improving coverage, data quality, developer experience, and overall reliability outcomes.
Champion and evangelize observability practices and tooling adoption across technology teams, helping incorporate new applications/tools into the monitoring platform.

Benefits

We offer a comprehensive package of benefits including paid time off, medical/dental/vision insurance, retirement, life insurance and other benefits to eligible employees.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume