Senior Site Reliability Engineer
Company: Crusoe
Location: San Francisco
Posted on: February 21, 2026
|
|
|
Job Description:
Job Description Job Description Crusoe's mission is to
accelerate the abundance of energy and intelligence. We’re crafting
the engine that powers a world where people can create ambitiously
with AI — without sacrificing scale, speed, or sustainability. Be a
part of the AI revolution with sustainable technology at Crusoe.
Here, you'll drive meaningful innovation, make a tangible impact,
and join a team that’s setting the pace for responsible,
transformative cloud infrastructure. About This Role: Crusoe is
building the most reliable, energy-efficient, AI-optimized cloud
platform — and operational excellence is at the heart of that
mission. As a Site Reliability Engineer focused on Operational
Excellence, you will help ensure the stability, resilience, and
performance of Crusoe’s GPU cloud. This role is ideal for engineers
who thrive in fast-paced environments, enjoy solving operational
problems, and want to grow their technical career while supporting
incident response, reliability, and continuous improvement across a
large-scale distributed platform. You’ll partner closely with
senior SREs, infrastructure engineers, and platform teams to
improve reliability, reduce operational toil, and strengthen
Crusoe’s incident management practices. What You’ll Be Working On:
Collaborate with cross-functional teams to define and refine
availability metrics for Crusoe’s cloud infrastructure, including
establishing, tracking, and improving SLIs and SLOs. Assist in
incident response by identifying, diagnosing, and resolving service
disruptions, and support post-incident processes through RCA
documentation and participation in post-incident reviews. Build,
operate, and monitor infrastructure health using Crusoe’s
observability stack (Prometheus, Grafana, Alertmanager,
OpenTelemetry). Identify and communicate reliability risks,
performance bottlenecks, and early indicators of potential
incidents that could impact service availability. Develop
automation and tooling to reduce operational toil, minimize manual
intervention, and enhance service recovery and self-healing
capabilities. Partner with compute, network, storage, and platform
teams to improve service resilience and strengthen disaster
recovery readiness. Contribute to knowledge sharing, process
improvements, and the development of operational best practices
across the organization. Participate in ongoing training,
mentorship, and professional development to grow into advanced SRE
responsibilities. What You’ll Bring to the Team: 5 years of
experience in cloud operations, SRE, or related roles Understanding
of cloud platforms and infrastructure fundamentals (Kubernetes,
AWS/GCP, virtualization, distributed systems) Familiarity with
incident management practices and operational frameworks
(SRE/ITIL/etc.) Experience with monitoring and alerting tools
(Prometheus, Grafana) or a strong willingness to learn Familiarity
with infrastructure-as-code and configuration management tools such
as Terraform and Ansible Basic Scripting and automation experience
(Go, Python, C, C++, or similar) Strong communication skills, with
the ability to clearly articulate technical issues to diverse
stakeholders Ability to stay calm, focused, and effective in
fast-moving or high-pressure situations A growth mindset with
enthusiasm for operational excellence, reliability engineering, and
continuous improvement Bonus Points: Experience with Kubernetes,
container orchestration, or large-scale distributed systems
Exposure to change management, operational readiness reviews, or
structured RCAs Familiarity with self-healing systems, automated
remediation, or event-driven operations Interest in scaling AI/HPC
infrastructure and solving reliability challenges in GPU-heavy
environments Passion for learning, mentorship, and developing
deeper SRE capabilities over time Benefits: Industry competitive
pay Restricted Stock Units in a fast growing, well-funded
technology company Health insurance package options that include
HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts Paid Parental Leave Paid
life insurance, short-term and long-term disability Teladoc 401(k)
with a 100% match up to 4% of salary Generous paid time off and
holiday schedule Cell phone reimbursement Tuition reimbursement
Subscription to the Calm app MetLife Legal Company paid commuter
benefit; $300 per month Compensation: Compensation will be paid in
the range of $172,000 - $209,000 Bonus. Restricted Stock Units are
included in all offers. Compensation to be determined by the
applicant’s education, experience, knowledge, skills, and
abilities, as well as internal equity and alignment with market
data. Crusoe is an Equal Opportunity Employer. Employment decisions
are made without regard to race, color, religion, disability,
genetic information, pregnancy, citizenship, marital status,
sex/gender, sexual preference/ orientation, gender identity, age,
veteran status, national origin, or any other status protected by
law or regulation.
Keywords: Crusoe, San Jose , Senior Site Reliability Engineer, IT / Software / Systems , San Francisco, California