Unfortunately, this job posting is expired.
Don't worry, we can still help! Below, please find related information to help you with your job search.
Some similar recruitments
Saas Site Reliability Engineer And Automation Developer
Recruited by Siemens Digital Industries Software 8 months ago Address , Costa Mesa, 92627 $116,900 - $210,400 a year
Senior Site Reliability Engineer, Trello
Recruited by Atlassian 8 months ago Address , San Francisco
Site Reliability Engineer, Product - Usds
Recruited by TikTok 8 months ago Address , Los Angeles $119,000 - $289,000 a year
Site Reliability Engineer, Systems
Recruited by Anthropic 8 months ago Address , San Francisco, Ca
Site Reliability Engineer (L4/5) - Core
Recruited by Netflix 8 months ago Address , Los Gatos, Ca
Software Engineer Iii, Site Reliability Engineering, Google Cloud
Recruited by Google 9 months ago Address Sunnyvale, CA, United States
Site Acquisition Specialist - Remote
Recruited by AFL 9 months ago Address Sacramento, CA, United States
Site Reliability Engineer Jobs
Recruited by Sohum Inc 10 months ago Address San Francisco Bay Area, United States
Site Reliability Engineer Jobs
Recruited by WalkWater Technologies 10 months ago Address Cupertino, CA, United States

Site Reliability Engineer Jobs

Company

Lawrence Berkeley National Laboratory

Address , San Francisco Bay Area, Ca
Employment type FULL_TIME
Salary $9,739 - $11,905 a month
Expires 2023-07-11
Posted at 11 months ago
Job Description

Lawrence Berkeley National Lab’s (LBNL) NERSC(National Energy Research Scientific Computing Center) Division has an opening for a Site Reliability Engineer to join the team.
In this exciting role, you will provide a variety of engineering support services 24x7 for the primary scientific computational facility for the Office of Science in the US Department of Energy (DOE). You will work for the Operations Team (Ops) and ensure that NERSC is accessible, reliable, secure and available to our scientific users.


What You Will Do:

  • Solve problems relating to mission critical services and create automation to prevent problem recurrence with the goal of automating response to all routine service conditions.
  • Conduct periodic on call duties as necessary to support a 24x7 workflow.
  • Provide accurate information in the trouble ticketing system for outages, maintenance, and other incidents so that others can monitor the workflow and protocols properly.
  • Under the guidelines of the group’s project manager, assist with developing and maintaining diagnostic tools used to support the HPC community within NERSC using programming languages like C, C++, python, java or perl. Must have knowledge of standard software development practices.
  • This position supports a 24x7 operation. While this position is for a daytime (8am - 4pm) or swing (4pm - 12am) schedule, off-hours work may be required in unique/emergency situations.
  • Using knowledge of the Facility Operations processes, provide input in the design of software tools, workflows and new procedures that continuously enhance the diagnostic capabilities of the group to ensure the high availability of the HPC services provided by NERSC.
  • Work closely with other NERSC groups to manage maintenance, to perform tasks like upgrades, to terminate batch queues, and to manage diagnostic and notification software or generally manage a center wide outage.
  • Using the guidelines of the book Site Reliability Engineering (O’Reilly, ISBN 978-1-491-92912-4), practice the SRE philosophy in software development and system operations.
  • Assist in the testing and implementation of new diagnostic tools, workflows and new capabilities for providing high availability for the systems in production. Write the documentation necessary for these new tools and train staff in their use.
  • Using the skillset of a junior Linux system administrator (as identified in the following publication), and their working knowledge of the systems the Operations Technology Group has responsibility for, monitor and manage the reliability of the NERSC facility to enable continuous scientific progress of the users in three areas: computation, data storage and the data center environment.

What is Required:

  • Motivated, self-starter who can learn emerging technologies that improve data center management in areas like Jupyter, Kibana, Functions as a Service, Kubenetes, building management software, evaporative cooling and power utilization.
  • Knowledge of the processes for standard operating procedures, and best practices for implementation and change management.
  • Strong hands-on knowledge of the Linux shell and working in a command-line (e.g. SSH) environment.
  • Past experience with Incident Management and a good understanding of IT service management.
  • Bachelor’s Degree in a Computer Science or similar discipline or equivalent years of experience.
  • Networking: experience with network theory such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.
  • Strong understanding of monitoring implementations and administration.
  • Demonstrated ability to deliver results on time with high quality.
  • Excellent problem solving skills with the ability to work on problems of diverse scope. Must be able to think independently, work collaboratively and contribute to an active intellectual environment. Must show good judgment and ability to schedule and lead a small group of people and/or projects. Systematic problem solving approach, coupled with a strong sense of ownership and drive.
  • Strong communication skills and ability to work effectively across multiple business and technical teams.
  • Experience with developing tools using various programming languages such as C, C++, perl, java and Python or a scripting language with knowledge of standard software development practices.
  • Exposure to Oracle or other high end Storage Infrastructure.
  • Minimum of three years of experience in UNIX or Linux, Networking, IT infrastructure environment and management experience in a distributed-computing environment.
  • Background configuring distributed, server-based or cluster-based infrastructure supporting a high volume of transactions in a Linux environment. An understanding of VM's and Containers, how to manage them and an understanding of the IoT technologies.
  • Minimum of 5 years related experience including 3 years as a system administrator or system engineering in a high-volume customer-facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continuous availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements. Hands-on experience in a Linux/UNIX or knowledge of or significant exposure to Red Hat Enterprise Linux or a Linux variant.
  • Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications.

Desired Qualifications:

  • Be able to provide input toward creating new standards and methods for managing large-scale distributed systems.
  • Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
  • Experience working in a 24/7 onsite team managing large data centers or other large installations. Working off shift is a lifestyle change that should be considered by the candidate.
  • A certification in a system administration area.
  • Network programming or a network certification.

Want to learn more about Berkeley Lab's Culture, Benefits and answers to FAQs? Please visit: https://recruiting.lbl.gov/


Notes:

  • This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
  • This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
  • Work may be performed in a hybrid work mode. The primary location for this role is Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA. Work must be performed within the United States.
  • This full salary range of this position is between $8,658 to $14,610 per month and is expected to pay between a targeted range of $9,739 to $11,905 per month depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience.

Based on University of California Policy - SARS-CoV-2 (COVID-19) Vaccination Program and U.S Federal Government requirements, Berkeley Lab requires that all members of our community obtain the COVID-19 vaccine as soon as they are eligible. As a condition of employment at Berkeley Lab, all Covered Individuals must Participate in the COVID-19 Vaccination Program by providing proof that vaccination requirements have been met or submitting a request for Exception or Deferral. Visit covid.lbl.gov for more information.


Berkeley Lab is committed to Inclusion, Diversity, Equity and Accountability (IDEA) and strives to continue building community with these shared values and commitments. Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from women, minorities, veterans, and all who would contribute to the Lab's mission of leading scientific discovery, inclusion, and professionalism. In support of our diverse global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.


Equal Opportunity and IDEA Information Links:

Know your rights, click here for the supplement: Equal Employment Opportunity is the Law and the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.