Sr. System Engineer
Company: Support Revolution
Location: San Jose
Posted on: September 22, 2024
|
|
Job Description:
Select how often (in days) to receive an alert: Create
AlertLocation: San Jose, California, United StatesAbout
Supermicro:Supermicro is a Top Tier provider of advanced server,
storage, and networking solutions for Data Center, Cloud Computing,
Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded
customers worldwide. We are the #5 fastest growing company among
the Silicon Valley Top 50 technology firms. Our unprecedented
global expansion has provided us with the opportunity to offer a
large number of new positions to the technology community. We seek
talented, passionate, and committed engineers, technologists, and
business leaders to join us.Job Summary:As a Sr. System Engineer,
you'll be the go-to person to roll out and maintain business
critical applications and services for Supermicro. You are also
responsible for resolving escalated service issues, coaching other
engineers to resolutions, engineering and implementing complex
projects. You will be a person who is independent with leadership
to drive the technical development and with excellent communication
skills.Essential Duties and Responsibilities:Includes the following
essential duties and responsibilities (other duties may also be
assigned):--- Execute comprehensive system-level rack tests on
latest NVidia and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC
processors, encompassing functionality, compatibility, performance,
stress, and reliability testing, leveraging proprietary in-house
tools.--- Establish expertise in HPC/AI applications and
benchmarks, delivering impactful training sessions to customers and
partners, while addressing complex customer support issues,
demonstrating innovative problem-solving skills and building robust
processes and procedures for HPC/AI solutions.--- Conduct proof of
concept design and testing, providing optimized benchmarks for
HPC/AI applications in a timely manner. Fine-tune BIOS settings,
optimize OS/network configurations, and develop diverse simulation
configurations to enhance efficiency across various workloads.---
Deliver on-site deployment services, ensuring customer acceptance
verification and providing post-level 1&2 support. Create and
maintain technical documentation, including technical notes, blogs,
and diagrams, to facilitate knowledge dissemination.--- Identify
and document hardware and software quality issues and collaborate
with Product Management and other Engineering teams to integrate
customer feedback into future product enhancements.--- Proactively
engage in HPC roadmap development, planning software and hardware
upgrades to sustain exceptional HPC infrastructure performance.---
Document and analyze test plans, reports, logs, and actively
contribute to the development of test utilities and automation
scripts to streamline testing processes.--- Travel is
required.Qualifications:--- BS/MS in Electrical Engineering,
Computer Engineering or Computer Science.--- 8+ years of
work-related experience in Deep Learning and Machine Learning.---
8+ years of Linux/networking debugging/testing or relevant
experience preferred.--- Experience with leading AI/ML frameworks
such as PyTorch, TensorFlow, ONNX, etc.--- Experience with DevOps
or in cloud environments, including but not limited to
Docker/Containers and Kubernetes.--- Hands-on experience with
workload/scheduler Managers (Slurm) for rack/cluster.--- Familiar
with MLPerf Training/Inference benchmark, LLM, HPL-AI or
RCCL/NCCL.--- Programming experience with windows and Linux shell
scripting.--- Strong sense of teamwork and good team player, strong
communication skills.--- Familiar with Intel/AMD/NVIDIA development
tool kits such as CUDA, oneAPI, ROCm is a plus.--- Experience with
server/network hardware debugging and troubleshooting is a plus.---
CCNA, OpenStack, OpenShift, Azure or AWS is a plus.Salary
Range:$140,000 - $158,000The salary offered will depend on several
factors, including your location, level, education, training,
specific skills, years of experience, and comparison to other
employees already in this role. In addition to a comprehensive
benefits package, candidates may be eligible for other forms of
compensation, such as participation in bonus and equity award
programs.EEO Statement:Supermicro is an Equal Opportunity Employer
and embraces diversity in our employee population. It is the policy
of Supermicro to provide equal opportunity to all qualified
applicants and employees without regard to race, color, religion,
sex, sexual orientation, gender identity, national origin, age,
disability, protected veteran status or special disabled veteran,
marital status, pregnancy, genetic information, or any other
legally protected status.
#J-18808-Ljbffr
Keywords: Support Revolution, San Jose , Sr. System Engineer, Other , San Jose, California
Click
here to apply!
|