Back to Jobs

[Remote] Sr. Engineering Manager, MLOps

Remote, USA Full-time Posted 2026-06-20

Note: The job is a remote job and is open to candidates in USA. Quince is a tech company disrupting the retail industry by leveraging AI, analytics, and automation. They are seeking a Senior Engineering Manager, MLOps to build and scale the infrastructure that supports production-grade Machine Learning, ensuring seamless operations for their Data Scientists and AI Researchers.

Responsibilities

  • Define the MLOps Vision & Strategy: Architect a long-term roadmap that transitions ML workflows from manual scripts to a fully automated, self-service platform for all Quince Data Scientists and AI Researchers
  • Own the "Paved Road" for Production: Build and maintain the end-to-end infrastructure for model training, deployment, and serving, ensuring researchers can move from "idea to production" with zero friction
  • Drive Strategic Prioritization: Partner with business leaders to align infrastructure investments with core e-commerce drivers like real-time personalization, dynamic pricing, and inventory forecasting
  • Lead "Build vs. Buy" Evaluations: Make high-judgment decisions on when to leverage cloud-native services (e.g., SageMaker, Vertex AI) versus building custom internal tools to optimize for cost, speed, and flexibility
  • Guarantee System Scalability & Reliability: Oversee the uptime and performance of production ML services, ensuring the stack can handle massive traffic surges and seasonal spikes without degradation
  • Manage Compute Governance & Costs: Direct the optimization of high-cost computational resources, such as GPU clusters and cloud instances, balancing high-performance training needs with fiscal responsibility
  • Recruit and Mentor Top Talent: Build and lead a high-performing team of ML Infra and DevOps engineers, providing technical coaching, career pathing, and performance management
  • Establish MLOps Standards: Drive the adoption of best practices in CI/CD for ML, Infrastructure as Code (IaC), and automated testing to ensure a modular and maintainable system
  • Bridge the Research-Engineering Gap: Act as the primary cross-functional lead, translating the complex needs of AI Researchers into actionable engineering requirements for the infrastructure team
  • Define and Track Velocity Metrics: Establish KPIs for the infrastructure team, such as model deployment frequency, mean time to recovery (MTTR), and infrastructure cost per inference
  • Champion Operational Excellence: Lead root-cause analyses (RCAs) for production failures and foster a culture of accountability where systemic fixes are prioritized over "quick patches."
  • Stay Ahead of the AI Curve: Monitor emerging trends in LLM-ops, vector databases, and real-time feature engineering to ensure Quince’s infrastructure remains competitive and future-proof

Skills

  • 10+ years of industry experience, with at least 3-5 years in a leadership or management role specifically focused on ML Infrastructure, MLOps, or large-scale Data Platform engineering
  • Proven track record of building and scaling MLOps platforms that support the full model lifecycle—from data ingestion and distributed training to real-time inference and monitoring
  • Deep technical expertise in cloud-native infrastructure (preferably AWS) and orchestration tools like Kubernetes (EKS), Docker, and Infrastructure as Code (Terraform/Pulumi)
  • Hands-on experience with ML frameworks and tooling, such as PyTorch, TensorFlow, Kubeflow, or SageMaker, and a strong opinion on how to integrate them into a cohesive developer experience
  • Expertise in building and managing Feature Stores and high-throughput data pipelines (using tools like Spark, Flink, or Kafka) to ensure data consistency across training and serving
  • Experience partnering with AI Research and Data Science teams to understand their unique workflows and translate research needs into robust, scalable engineering solutions
  • Strong understanding of CI/CD for ML, including automated testing for models, model versioning, and 'blue-green' or 'canary' deployment strategies
  • Demonstrated ability to manage high-cost compute resources, with experience optimizing GPU utilization and cloud spend in a hyper-growth environment
  • Excellence in operational leadership, with a history of driving service availability, performance, and stability through rigorous on-call rotations and root-cause analysis
  • A product-oriented mindset, with the ability to treat infrastructure as a platform and prioritize the roadmap based on researcher velocity and business ROI
  • Exceptional communication and influence skills, capable of navigating ambiguity and building consensus across engineering, product, and data science leadership
  • Kindness and high standards: You move fast and push for excellence, but you do so as a supportive team player who fosters a culture of psychological safety and extreme candor

Benefits

  • Bonus and equity may also be provided for eligible roles

Company Overview

  • Quince is an e-commerce company that offers apparel, accessories, home goods, and personal care products through an online platform. It was founded in 2018, and is headquartered in San Francisco, California, USA, with a workforce of 1001-5000 employees. Its website is https://www.quince.com.
  • Company H1B Sponsorship

  • Quince has a track record of offering H1B sponsorships, with 1 in 2023. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    Similar Jobs