Senior Staff Engineer, Memory Fault Management Architect
Company: Conductor
Location: San Jose
Posted on: January 22, 2025
Job Description:
Senior Staff Engineer, Memory Fault Management ArchitectSan
Jose, California, United StatesPlease Note:To provide the best
candidate experience amidst our high application volumes, each
candidate is limited to 10 applications across all open jobs within
a 6-month period.Advancing the World's Technology TogetherOur
technology solutions power the tools you use every day--including
smartphones, electric vehicles, hyperscale data centers, IoT
devices, and so much more. Here, you'll have an opportunity to be
part of a global leader whose innovative designs are pushing the
boundaries of what's possible and powering the future.We believe
innovation and growth are driven by an inclusive culture and a
diverse workforce. We're dedicated to empowering people to be their
true selves. Together, we're building a better tomorrow for our
employees, customers, partners, and communities.Conventional DRAM
failure analysis was physical electrical FA and physical FA. But,
in the era of Data center, it is easier to track the field failure
information. With this data set, Fault management team's role is
finding DRAM failure mode, abnormality and failure rate
projection.You will be part of an incubation team working on
in-field telemetry intended to transform the Customer Quality
Experience for Samsung memory products. Fault Management is the
future of quality to minimize system downtime within AI/ML hardware
deployments and workloads of the future. We analyze trends and
patterns from enormous memory fleet telemetry to bucketize failures
and perform virtual root-cause analysis. Telemetry analysis helps
us design solutions to proactively avoid system downtime. We
conduct research and develop both in-house and collaboratively in
the industry with the opportunity to publish our findings through
whitepapers and conferences. We are looking for innovative and
passionate thinkers who can work in a start-up environment and are
excited to shape the future of data centers around the world. Join
us in our mission!What You'll Do
- Based on the knowledge of SOC controller and memory operation
including RAS feature, find and recommend better solutions to
mitigate the field DRAM failure rate.
- Communicate better ECC schemes to customers based on Samsung
DRAM failure mode(DQ and burst).
- Interface with customers to establish the value add of enabling
in-field fault management architecture.
- Contribute to the standardization of DRAM/HBM failure logging
in the OCP.
- Propose and develop platform RAS (Reliability Availability
Serviceability) algorithms for memory fault management such as page
offlining, hPPR and conduct POC with known failure DIMMs in the
real server and application.Location: Hybrid with at least 3 days
in office in San Jose, CA office location remainder of time to work
remotely.What You Bring
- Bachelors with 15+ years of relevant industry experience, or
Masters with 13+ years or PhD with 10+ years hardware fault
management, reliability, data center fleet management experience or
related technical field preferred.
- Knowledge of platform memory subsystem, platform RAS
(Reliability Availability Serviceability) such as ECC, page
offlining, hPPR and hardware sparing.
- ECC design and verification and reverse engineering
experience.
- Understanding on the address mapping between CPU and
memory.
- Memory controller register modification.
- DRAM and HBM failure mode understanding.
- Excellent communication and interpersonal skills.
- Ability to work independently and as part of a team.
- You're inclusive, adapting your style to the situation and
diverse global norms of our people.
- An avid learner, you approach challenges with curiosity and
resilience, seeking data to help build understanding.
- You're collaborative, building relationships, humbly offering
support and openly welcoming approaches.
- Innovative and creative, you proactively explore new ideas and
adapt quickly to change.What We OfferThe pay range below is for all
roles at this level across all US locations and functions.
Individual pay rates depend on a number of factors-including the
role's function and location, as well as the individual's
knowledge, skills, experience, education, and training. We also
offer incentive opportunities that reward employees based on
individual and company performance.This is in addition to our
diverse package of benefits centered around the wellbeing of our
employees and their loved ones. In addition to the usual
Medical/Dental/Vision/401k, our inclusive rewards plan empowers our
people to care for their whole selves. An investment in your future
is an investment in ours.Equal Opportunity Employment PolicySamsung
Semiconductor takes pride in being an equal opportunity workplace
dedicated to fostering an environment where all individuals feel
valued and empowered to excel, regardless of race, religion, color,
age, disability, sex, gender identity, sexual orientation,
ancestry, genetic information, marital status, national origin,
political affiliation, or veteran status.When selecting team
members, we prioritize talent and qualities such as humility,
kindness, and dedication. We extend comprehensive accommodations
throughout our recruiting processes for candidates with
disabilities, long-term conditions, neurodivergent individuals, or
those requiring pregnancy-related support. All candidates scheduled
for an interview will receive guidance on requesting
accommodations.
#J-18808-Ljbffr
Keywords: Conductor, San Jose , Senior Staff Engineer, Memory Fault Management Architect, Professions , San Jose, California
Didn't find what you're looking for? Search again!
Loading more jobs...