Unfortunately, this job posting is expired.
Don't worry, we can still help! Below, please find related information to help you with your job search.
Don't worry, we can still help! Below, please find related information to help you with your job search.
Related keywords
- Site Reliability Engineer
- Senior Site Reliability Engineer
- Lead Site Reliability Engineer
- Remote Site Reliability Engineer
- Principal Site Reliability Engineer
- Senior Associate Site Reliability Engineer
- Crypto Site Reliability Engineer
- Site Reliability Engineer Internship
- Cloud Site Reliability Engineer
- Site Reliability Operations Engineer
Senior Site Reliability Engineer
Company | NVIDIA |
Address | California, United States |
Employment type | FULL_TIME |
Salary | |
Category | Computer Hardware Manufacturing,Software Development,Computers and Electronics Manufacturing |
Expires | 2023-08-25 |
Posted at | 9 months ago |
NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s motivated by outstanding technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. NVIDIA is at the forefront of generative AI models, from language to images. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work.
- Architect, design, and code using your expertise to optimize, deploy and productize services.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
- Lead significant production improvement around tooling, automation, and process.
- Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
- Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
- Monitoring & supporting critical high-performance, large-scale services running multi-cloud.
- Participate in the triage & resolution of sophisticated infra-related issues.
- Practice balanced incident response and blameless postmortems.
- Be part of an on-call rotation to support production systems.
- Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.
- Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives.
- Solid understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and best practices with K8s.
- Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.
- 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner.
- Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.
- Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation).
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
- Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them.
- Experience with the ELK and Prometheus stacks as a power user and administrator.
- Excellent communication, presentation, social, and analytical skills; the ability to communicate complex concepts and internations clearly and persuasively across different audiences and varying levels of the organization.
- Understanding of observability instrumentation techniques and best practices, including OpenTelemetry.
- Experience with StackStorm and similar automation platforms is a bonus.
- Exposure to containerization and cloud-based deployments for AI models.
- Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.
- Prior experience driving production issues and helping with on-call support.
- Understanding of Deep Learning / Machine Learning / AI.
- Excellent coding: Python, Go (Any similar language).
-
Systems Analyst - Excel, Xml, Sql, Scripting
By CyberCoders At Salt Lake City, UT, United States 7 months ago
-
(Senior) Finance & Shared Services Manager
By Catholics For Choice At Washington, DC, United States 7 months ago
-
Paralegal - Probate Administration
By CyberCoders At Miami, FL, United States 7 months ago
-
Account Executive - Automotive Software
By ECW Search At United States 7 months ago
-
Construction Project Coordinator Jobs
By CyberCoders At River Falls, WI, United States 7 months ago