AWS), Ferra

25madison•New York, NY

2d•$145,000 - $190,000•Hybrid

About The Position

Ferra is building AI infrastructure for structural steel estimation. We process large-scale construction drawing PDFs, run computer vision + LLM pipelines, and generate structured steel graphs, takeoffs, and export-ready models. Our system includes: Multi-stage ML pipelines (CV + LLM) Asynchronous job processing (SQS-driven workflows) Large PDF ingestion and document graph processing Vector-native parsing and algorithmic geometry systems Graph storage + export services Role Overview: We are hiring a Senior Infrastructure Engineer to own uptime, reliability, latency, and scalability across our entire AWS environment. You will ensure our AI/ML pipelines run reliably at scale — without cloud outages, timeouts, networking bottlenecks, or production instability slowing down our algorithm team. You will build and maintain production-grade AWS architecture that supports: Large PDF ingestion (100–500+ sheets) Computer vision pipelines LLM inference workflows Distributed job queues High-volume asynchronous processing Your mission is to enable the frontend teams to move fast without worrying about infrastructure.

Requirements

8+ years in infrastructure / DevOps / production engineering
Deep AWS expertise (not just “used it” — architected at scale)
Experience running production ML or AI systems
Experience with asynchronous distributed systems
Strong knowledge of: ECS / Fargate, EC2 (including GPU instances), SQS, S3, VPC networking, and IAM best practices
Strong understanding of: Containerization (Docker), CI/CD pipelines, Infrastructure as Code and observability systems
Experience debugging production incidents and designing fault-tolerant systems

Nice To Haves

Prior exposure to GPU workloads at scale, event-driven architectures, or PDF/document-heavy pipelines.
Bonus if you've done this in a startup environment where the infrastructure and the product were both still being figured out.

Responsibilities

Keep things running. You own uptime (99.9%+), observability, incident response, and root cause analysis. When something breaks, you fix it — and make sure it doesn't break the same way twice.
Own the AWS architecture. Deep AWS stack: EC2 (including GPU), ECS/Fargate, SQS, Lambda, S3, CloudFront, API Gateway, RDS/DynamoDB — plus VPC design, IAM, autoscaling, and monitoring. You'll make the architectural calls, not just maintain what's there.
Make ML pipelines reliable. The core workloads are CV, LLM inference, and long-running batch jobs. You'll build the plumbing: retry logic, idempotency, checkpointing, parallel orchestration. Experience with event-driven or DAG-based pipelines is a plus.
Chase down performance problems. Queue bottlenecks, cold starts, LLM latency, runaway costs: you will find and fix them. Comfortable debugging at the TCP, TLS, ECS, and IAM level.
Help the team ship faster. CI/CD, infrastructure-as-code (Terraform/CDK/Pulumi), clean containerization, and proper staging environments. The goal: deployments are boring and "works on my machine" stops being an excuse.