SanJoseCARecruiter Since 2001
the smart solution for San Jose jobs

Senior Staff Engineer, Memory Fault Management Architect

Company: Conductor
Location: San Jose
Posted on: January 22, 2025

Job Description:

Senior Staff Engineer, Memory Fault Management ArchitectSan Jose, California, United StatesPlease Note:To provide the best candidate experience amidst our high application volumes, each candidate is limited to 10 applications across all open jobs within a 6-month period.Advancing the World's Technology TogetherOur technology solutions power the tools you use every day--including smartphones, electric vehicles, hyperscale data centers, IoT devices, and so much more. Here, you'll have an opportunity to be part of a global leader whose innovative designs are pushing the boundaries of what's possible and powering the future.We believe innovation and growth are driven by an inclusive culture and a diverse workforce. We're dedicated to empowering people to be their true selves. Together, we're building a better tomorrow for our employees, customers, partners, and communities.Conventional DRAM failure analysis was physical electrical FA and physical FA. But, in the era of Data center, it is easier to track the field failure information. With this data set, Fault management team's role is finding DRAM failure mode, abnormality and failure rate projection.You will be part of an incubation team working on in-field telemetry intended to transform the Customer Quality Experience for Samsung memory products. Fault Management is the future of quality to minimize system downtime within AI/ML hardware deployments and workloads of the future. We analyze trends and patterns from enormous memory fleet telemetry to bucketize failures and perform virtual root-cause analysis. Telemetry analysis helps us design solutions to proactively avoid system downtime. We conduct research and develop both in-house and collaboratively in the industry with the opportunity to publish our findings through whitepapers and conferences. We are looking for innovative and passionate thinkers who can work in a start-up environment and are excited to shape the future of data centers around the world. Join us in our mission!What You'll Do

  • Based on the knowledge of SOC controller and memory operation including RAS feature, find and recommend better solutions to mitigate the field DRAM failure rate.
  • Communicate better ECC schemes to customers based on Samsung DRAM failure mode(DQ and burst).
  • Interface with customers to establish the value add of enabling in-field fault management architecture.
  • Contribute to the standardization of DRAM/HBM failure logging in the OCP.
  • Propose and develop platform RAS (Reliability Availability Serviceability) algorithms for memory fault management such as page offlining, hPPR and conduct POC with known failure DIMMs in the real server and application.Location: Hybrid with at least 3 days in office in San Jose, CA office location remainder of time to work remotely.What You Bring
    • Bachelors with 15+ years of relevant industry experience, or Masters with 13+ years or PhD with 10+ years hardware fault management, reliability, data center fleet management experience or related technical field preferred.
    • Knowledge of platform memory subsystem, platform RAS (Reliability Availability Serviceability) such as ECC, page offlining, hPPR and hardware sparing.
    • ECC design and verification and reverse engineering experience.
    • Understanding on the address mapping between CPU and memory.
    • Memory controller register modification.
    • DRAM and HBM failure mode understanding.
    • Excellent communication and interpersonal skills.
    • Ability to work independently and as part of a team.
    • You're inclusive, adapting your style to the situation and diverse global norms of our people.
    • An avid learner, you approach challenges with curiosity and resilience, seeking data to help build understanding.
    • You're collaborative, building relationships, humbly offering support and openly welcoming approaches.
    • Innovative and creative, you proactively explore new ideas and adapt quickly to change.What We OfferThe pay range below is for all roles at this level across all US locations and functions. Individual pay rates depend on a number of factors-including the role's function and location, as well as the individual's knowledge, skills, experience, education, and training. We also offer incentive opportunities that reward employees based on individual and company performance.This is in addition to our diverse package of benefits centered around the wellbeing of our employees and their loved ones. In addition to the usual Medical/Dental/Vision/401k, our inclusive rewards plan empowers our people to care for their whole selves. An investment in your future is an investment in ours.Equal Opportunity Employment PolicySamsung Semiconductor takes pride in being an equal opportunity workplace dedicated to fostering an environment where all individuals feel valued and empowered to excel, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status.When selecting team members, we prioritize talent and qualities such as humility, kindness, and dedication. We extend comprehensive accommodations throughout our recruiting processes for candidates with disabilities, long-term conditions, neurodivergent individuals, or those requiring pregnancy-related support. All candidates scheduled for an interview will receive guidance on requesting accommodations.
      #J-18808-Ljbffr

Keywords: Conductor, San Jose , Senior Staff Engineer, Memory Fault Management Architect, Professions , San Jose, California

Click here to apply!

Didn't find what you're looking for? Search again!

I'm looking for
in category
within


Log In or Create An Account

Get the latest California jobs by following @recnetCA on Twitter!

San Jose RSS job feeds