Don't worry, we can still help! Below, please find related information to help you with your job search.
- Reliability Equipment Engineer
- Strategic Reliability Engineer
- Electrical Reliability Engineer
- Reliability Project Engineer
- Reliability Engineer
- Facilities Reliability Engineer
- System Operations Engineer
- Remote Reliability Engineer
- Site Reliability Operations Engineer
- Senior System Reliability Engineer
System Reliability Operations Engineer
Company | Disney |
Address | , Lake Buena Vista |
Employment type | |
Salary | |
Expires | 2023-10-09 |
Posted at | 8 months ago |
System Reliability Operations Engineer
Job Summary:
Within Disney Enterprise Technology, the Disney Technology Operations Command Center (DTOC) is a 24x7x365 critical services operation center responsible for service availability, with main focus to rapidly respond to, correlate for, and reduce impact of outages. We are accountable for identifying and facilitating the resolution of service impacting events, and collaborating with other technology teams to prevent future impact through proactive event management, incident and problem analysis. DTOC drives the execution of the major incident process including communication to executives and key partners, including owning and implementing Crisis Management plans and processes. DTOC also provides ongoing first and second-level technical support of requests, performs validation procedures for routine system/service checks, and fulfills proactive monitoring of significant business events.
System Reliability Operations (SRO) Engineers ensure all processes and functions within our environment operate correctly and efficiently – monitoring, identifying, and coordinating with other technologists across segments to fine-tune system operations and resolve service interruptions. This role is responsible for the end-to-end reliability and operations of IT services and performing consultations and training to other clients and segments across Disney. SROs consistently and reliably triage reported or automated incidents, apply recovery procedures, and engage domain experts to restore steady-state operations. Additionally, this position will drive service improvement initiatives through proactive monitoring and enhancement actions from gaps identified through analytics and problem management.
Responsibilities:
- Proactively identify, diagnose, fix, and resolve infrastructure, application, and IT operations issues in collaboration with other IT support teams
- Implement and maintain technology observability and alerting solutions to provide real-time insights into system health, performance, and compliance
- Effectively apply Problem & Incident Analysis techniques during an incident and post-incident
- Ensure that all DTOC services are designed to deliver the levels of availability required by the business
- Develop, implement, and maintain automation tools and scripts to improve the efficiency and reliability of IT operations and infrastructure
- Perform DR/BCP activities for critical events and emergency onsite response
- Identify and drive service availability improvement opportunities by driving leading practices
- Supervise the performance and availability of enterprise applications, systems, and infrastructure, ensuring they meet or exceed established service level objectives (SLOs)
- Identify service improvement opportunities through trend analysis, proactive techniques, and after-action reviews
- Address outages in a timely fashion, ensuring work streams towards resolution following department procedures while presenting business impacts
- Analyze and publish operational utilization and service performance metrics
Required
- 2+ years incident recovery with demonstrated experience with Service and Event Management tools
- Demonstrated experience in systems integration, application infrastructure support, and middleware operations.
- Experience in enterprise IT operations including system administration, application platforms, infrastructure, networking fundamentals, and IT service management
- Experience working in a 24x7 IT operations environment
- Experience with hands-on support of cloud operations (AWS, Google Cloud, Azure)
- BA/BS in Computer Science, Engineering or related field; or equivalent work experience
- Solid understanding of observability, monitoring, and alerting tools (ex. Splunk, New Relic, Grafana, ELK Stack, Datadog)
- 2+ years experience supporting converged infrastructure stacks including application, compute, storage, and networking
- Experience within network technologies (WAN/LAN, wireless infrastructure, DNS/DHCP, Load-Balancers, Accelerators)
- Proficiency in one or more scripting/automation languages (ex. Python, PowerShell, Bash, Ruby)
- Experience with x86 hardware technology, Windows, Linux, RISC operating systems, P-Series hardware, SAN, NAS, and data protection technologies
- Strong technology problem-solving and analytical skills, with the ability to quickly diagnose and resolve technical issues.
Preferred
- Master’s degree in a technical field
- Certification/s within Kepner-Tregoe, ITIL Foundations (V3), operating systems, visualization, and/or hardware platforms
-
Systems Analyst - Excel, Xml, Sql, Scripting
By CyberCoders At Salt Lake City, UT, United States 7 months ago
-
(Senior) Finance & Shared Services Manager
By Catholics For Choice At Washington, DC, United States 7 months ago
-
Paralegal - Probate Administration
By CyberCoders At Miami, FL, United States 7 months ago
-
Account Executive - Automotive Software
By ECW Search At United States 7 months ago
-
Construction Project Coordinator Jobs
By CyberCoders At River Falls, WI, United States 7 months ago