High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility. Apply below after reading through all the details and supporting information regarding this job opportunity. Role Overview:<ul><li>Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.</li><li>Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.</li><li>Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.</li></ul> Responsibilities<ul><li>Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements</li><li>Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors</li><li>Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost</li><li>Troubleshooting across the full stack, including hardware, networking, and distributed systems</li><li>Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency</li></ul> Participation in an on-call rotation required (approximately one week per month). Key Attributes<ul><li>Strong ownership mindset with focus on delivery and accountability</li><li>Experience building maintainable, well-documented systems in complex environments</li><li>Ability to operate effectively in ambiguous and rapidly evolving contexts</li><li>Clear and effective communication skills with collaborative, low-ego approach</li></ul> Minimum Requirements<ul><li>5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing</li><li>Strong written and verbal communication skills in English</li><li>Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)</li><li>Programming or scripting experience in Go, Python, or xwzovoh Bash</li><li>Familiarity with infrastructure automation and infrastructure-as-code tools</li><li>Strong technical foundation in computing or related discipline</li></ul> Preferred Experience<ul><li>Experience operating large-scale machine learning or AI-compute workloads</li><li>Background in multi-tenant distributed systems at scale</li><li>Hands-on experience with data centre or bare-metal infrastructure</li><li>Knowledge of high-performance networking technologies</li><li>Experience managing large-scale storage systems (commercial or open-source)</li></ul> Compensation & Benefits<ul><li>Competitive salary and equity package</li><li>Retirement or pension contributions aligned with local standards</li><li>Health coverage including medical, dental, and vision</li><li>Generous paid time off policy</li></ul>

Senior Site Reliability Engineer

Platform Engineer - hybrid

Site Reliability Engineer

Azure Cloud Engineers - IaC, Automation, Azure Devops

Site Reliability Engineer (SRE)

SC Cleared DevOps Engineer

AI Platform Engineer (DevOps / MLOps Focus)

Platform Engineer

Senior DevOps Engineer

DevOps Engineer

Senior Site Reliability Engineer

Job description