:Main Responsibilities:
- AI-Augmented Data Pipelines: Design and maintain AI-augmented, large-scale data pipelines (billions of images) integrating traditional transformations with ML models (classifiers, embeddings, LLMs) for cleaning and annotation.
- Remote Inference Orchestration: Own the systems for remote ML model inference orchestration within pipelines, managing batching, retries, async jobs, and ensuring graceful degradation.
- Feature Pipelines: Build and maintain scalable pipelines for generating, storing, and serving vector embeddings, including nearest-neighbor index management and quality validation.
- Data Curation at Scale: Source, filter, and curate training datasets using a combination of SQL and model-derived signals (e.g., aesthetic scores, NSFW classifiers), owning the end-to-end data flow and maintaining governance, quality, and compliance.
Additional Responsibilities:
- LLM-Assisted Annotation: Design and operate pipelines that use LLMs and vision models for automated annotation of training data, including auditing workflows to measure and improve annotation model performance.
- Tooling & Frameworks: Contribute to shared tooling and frameworks that make it easier for the broader team to build AI-augmented data pipelines — e.g., reusable operators for model invocation, standard patterns for async job management.
Qualifications:
- Bachelor's degree or higher in Computer Science, Data Engineering, Machine Learning, or a related STEM field.
- 5+ years of industry experience in data engineering, ML engineering, or a hybrid role involving both data pipelines and model serving/inference.
- Demonstrated track record of building and operating production data pipelines that invoke ML models at scale.
Pay: $50.00 - $55.00 per hour
Work Location: In person