Infrastructure Talent Pool (Storage Engineer, Site Reliab...

Intro

Cohere is seeking talented individuals to join their Infrastructure Team in roles such as Storage Engineer, Site Reliability Engineer, and MLOps. The team is responsible for building world-class infrastructure critical to Cohere's success in AI systems. They value diversity, inclusivity, and a collaborative work environment. If you have experience in engineering, supporting MLEs or data scientists, designing distributed systems with Kubernetes, and working in complex Linux-based environments, this could be the place for you. Cohere offers a range of perks and benefits to support employees' well-being and personal development.

Tasks

Designing, deploying, supporting, and troubleshooting in complex Linux-based distributed computing environments
Running production infrastructure at a large scale
Working with and supporting MLEs or data scientists
Synchronizing data between different cloud providers
Building internal tooling to help data engineers manage costs and data lifecycle optimization

Requirements

5+ years of engineering experience running production infrastructure at a large scale
Experience with Kubernetes and GPU workloads
Experience with distributed filesystems like Lustre, Weka, or Vast
Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based distributed computing environments

Benefits

Open and inclusive culture
Work on cutting-edge AI research
Weekly lunch stipend
Full health and dental benefits
Parental Leave top-up
Personal enrichment benefits
Remote-flexible with offices in multiple locations
6 weeks of vacation

Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Intro

Tasks

Requirements

Benefits

Similar impact jobs