HPC SRE

Posted 5 days ago

London, Greater London
Any
External
Expires In 3 months

High Performance Compute (HPC) Site Reliability Engineer (Senior Level Role) - Leading AI CompanyRemote | United Kingdom (sponsorship not offered)The client: Our client is a start-up who offers an innovative form of GPU computing infrastructure and Cloud-Native Orchestration solutions to Technology and AI firms worldwide. 🧑🏿‍💻👨🏼‍💻👩🏾‍💻What they need: They are looking for a senior HPC SRE to play a key role guaranteeing the reliability of their HPC environments. You will have an opportunity to design and build HPC infrastructure, as well as setting up clusters from scratch. Key skills include HPC architecture, networking technologies (ie TCP, UDP, IPv4 or MP-BGP), and network protocols (ie RoCE or RDMA). In this position you will:Get to set up HPC clusters from scratch. You will take ownership of investigations into high-priority incidents, identify solutions, andprepare Root Cause Analysis (RCA).Work in the AI space, where you will get exposure to the latest Machine Learning technology.In addition, you will also have the chance to help grow and enhance the community, as well as being seen as leader within your specific discipline.Please apply to find out more.

View more similar jobs