R

Posted 4 days ago

Senior Site Reliability Engineer

Realm

📍 London

Information Technology

Job description

<p>High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.</p><p><br><br>Apply below after reading through all the details and supporting information regarding this job opportunity.<br></p><p><strong>Role Overview:</strong></p><ul><li>Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.</li><li>Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.</li><li>Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.</li></ul><p><br></p><p><strong>Responsibilities</strong></p><ul><li>Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements</li><li>Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors</li><li>Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost</li><li>Troubleshooting across the full stack, including hardware, networking, and distributed systems</li><li>Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency</li></ul><p><br></p><p>Participation in an on-call rotation required (approximately one week per month).</p><p><br></p><p><strong>Key Attributes</strong></p><ul><li>Strong ownership mindset with focus on delivery and accountability</li><li>Experience building maintainable, well-documented systems in complex environments</li><li>Ability to operate effectively in ambiguous and rapidly evolving contexts</li><li>Clear and effective communication skills with collaborative, low-ego approach</li></ul><p><br></p><p><strong>Minimum Requirements</strong></p><ul><li>5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing</li><li>Strong written and verbal communication skills in English</li><li>Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)</li><li>Programming or scripting experience in Go, Python, or xwzovoh Bash</li><li>Familiarity with infrastructure automation and infrastructure-as-code tools</li><li>Strong technical foundation in computing or related discipline</li></ul><p><br></p><p><strong>Preferred Experience</strong></p><ul><li>Experience operating large-scale machine learning or AI-compute workloads</li><li>Background in multi-tenant distributed systems at scale</li><li>Hands-on experience with data centre or bare-metal infrastructure</li><li>Knowledge of high-performance networking technologies</li><li>Experience managing large-scale storage systems (commercial or open-source)</li></ul><p><br></p><p><strong>Compensation & Benefits</strong></p><ul><li>Competitive salary and equity package</li><li>Retirement or pension contributions aligned with local standards</li><li>Health coverage including medical, dental, and vision</li><li>Generous paid time off policy</li></ul>
Apply Now →