About The Position

Global trade still runs on outdated, manual workflows - we are fixing that by building AI agents for the logistics industry. Our AI works alongside humans, automating document-heavy tasks so companies can process shipments faster and with fewer errors. We have moved past the "zero-to-one" phase and have achieved clear product-market fit. We are currently seeing rapid traction with >100% MoM revenue growth and are already deployed with customers processing meaningful operational volume. We've raised $5M from First Round Capital and Pear VC and are now scaling our platform's breadth and depth. Our deeply technical team comes from Google, LinkedIn, Salesforce and top schools and AI research labs. The Role We are looking for a Production Engineer who lives at the intersection of software development and systems engineering. Your mission is to ensure our production environment is rock-solid, automated, and observable. You will own our CI/CD pipelines, manage our AI infrastructure, and build the internal tools that empower our development team to ship code faster and more reliably.

Requirements

  • Backend Proficiency: Strong experience in at least one backend language (e.g., Python, Go, Java) to contribute to internal tools and understand application logic.
  • Infrastructure as Code (IaC): Hands-on experience with Terraform, CloudFormation, or Ansible.
  • Containerization: Deep knowledge of Docker and orchestration (Kubernetes/ECS).
  • Cloud Platforms: Good-level knowledge of GCP
  • CI/CD Tools: Experience with GitHub Actions, GitLab CI, or Jenkins.

Responsibilities

  • Availability: Own the "uptime" of our services. Design and implement self-healing systems to minimize downtime and manual intervention.
  • CI/CD & Deployments: Architect and manage robust deployment pipelines to ensure feature releases are seamless and reversible.
  • AI Infrastructure: Manage specialized pipelines for AI and human-in-the-loop systems
  • Databases and compliance: Manage database operations, performance tuning, backups, compliance.
  • Scalability: Monitor system performance and proactively scale infrastructure to handle traffic spikes.
  • Monitoring: Build and maintain comprehensive dashboards using tools like Prometheus, Grafana, or Datadog.
  • Alerting: Define and implement "Golden Signals" (Latency, Traffic, Errors, and Saturation) to ensure we know about issues before our customers do.
  • Incident Response: Lead the "Post-Mortem" process - analyzing why things broke and writing code to ensure they never break the same way twice.
  • Custom Tooling: Use your backend skills (Python preferably) to build internal CLI tools, automated scripts, and status dashboards.
  • Developer Experience: Act as a bridge for the dev team, making "the right way to deploy" the "easiest way to deploy."
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service