Staff Data Engineer (Agent Systems)

Faraday Future•Gardena, CA

11h

About The Position

As a Staff Data Engineer (Agent Systems) in our Crypto projects, you will design, deliver, and operate the data platform that powers our agentic products—real-time ingestion (on-chain/market/social), feature stores, vectorization & retrieval for RAG, time-series/streaming computation, and ML observability. You’ll define data contracts and SLAs, ensure offline–online consistency, and partner closely with AI Agents, Backend/BFF, and Security/Compliance.

Requirements

Bachelor’s degree or above in CS/EE/Math/Stats or related.
7+ years in data engineering with 3+ years building streaming pipelines/feature stores for production systems.
Proficient in Python and SQL, plus one of Java/Scala/Go, strong data modeling and performance tuning.
Streaming & batch: Kafka, Flink/Spark (stateful ops, event-time, watermarking, exactly-once), Airflow/Dagster, dbt.
Storage: PostgreSQL/MySQL, ClickHouse/BigQuery/Snowflake, NoSQL (MongoDB/DynamoDB/Bigtable/Firestore), Redis, and lakehouse on Amazon S3 or Google Cloud Storage (GCS) (Parquet + Iceberg/Delta).
Feature platforms (e.g., Feast) and online feature serving; offline–online consistency validation.
Vector retrieval: embeddings pipelines and vector stores (pgvector/FAISS/Milvus); relevance & recency metrics.
Ops/Observability: Docker/K8s; data quality/lineage (OpenLineage/Marquez or similar); cost & throughput optimization.

Nice To Haves

Crypto-signal ingestion (order books, trades, on-chain events); precision arithmetic and idempotent metrics.
Privacy/compliance (GDPR/CCPA), tokenization/pseudonymization strategies.
Cost/perf tuning (autoscaling, compaction/retention, caching) and SRE collaboration.

Responsibilities

Platform Architecture: Author data contracts and schemas; produce ADRs; design tiered storage (OLTP/OLAP, lakehouse) with governance and lineage.
Streaming & Batch Pipelines: Build low-latency streams (Kafka/Flink or equivalent) and robust batch ETL (Airflow/Dagster + dbt); support CDC, replay/backfill, and schema evolution.
Feature Store & Online Serving: Provide point-in-time-correct features (near-real-time/time-series); guarantee offline–online parity and latency SLOs.
RAG Data Plane: Orchestrate embedding pipelines, chunking/routing, vector DB (pgvector/FAISS/Milvus), HNSW/IVF indexes, and reindexing/TTL strategies.
Evaluation & ML Ops: Materialize canonical eval datasets/labels; wire A/B hooks; manage model/feature registries and CI/CD for ML; enable canary rollouts.
Data Quality & Observability: Monitor freshness, completeness, duplication, drift/decay; implement lineage and cost/performance guardrails.
Security & Compliance: Enforce PII handling, retention, and auditability; implement least-privilege access to datasets and secrets.
Collaboration: Work hand-in-hand with DS/Agents/Backend on interfaces, SLAs, and incident RCAs; document playbooks and standards.