Site Reliability Engineer
As a Site Reliability Engineer (SRE), you will be an integral part of the team at LightEdge Solutions. This position will report to the DevOps Manager, and will be responsible for reliable operation of the organization’s systems and services. You will play a key role in identifying our monitoring strategy and vision across multiple products and work with a variety of teams to improve the accuracy of our monitoring systems.
Responsibilities
- Monitoring and Observability: Design and implement monitoring solutions to track the performance, availability, and health of various systems and services. Establish robust monitoring frameworks, set up alerts, and analyze system metrics to identify and resolve issues proactively.
- Establish and align metrics, including SLAs, SLOs, and SLIs, to closely tie system performance to business objectives, ensuring that the site reliability engineering efforts support the overall goals and customer satisfaction.
- Utilize AIOPS techniques to leverage automation in Incident Management and Response. Develop and maintain automated incident response systems that can detect and mitigate issues automatically. This includes automated incident triaging, remediation, and escalation workflows to minimize manual intervention and improve response times.
- Leverage the IT service management platform’s capabilities to integrate monitoring into incident management, change management, and other operational processes, enhancing the efficiency and effectiveness of site reliability engineering practices.
- Working closely with IT functional owners & SME’s.
- Perform complex systems design, proof of concept, implementation and integration functions.
- Tasks will consist of developing detailed designs, execution and troubleshooting of strategic solutions in support of effective monitoring, alerting, escalation, automation, reporting and event correlation
Education and Experience
- 5 years hands-on experience with enterprise monitoring solutions
- Must possess knowledge of Network Switches, Server hardware, Storage, and Virtualization Technologies
- Understanding of VMware Infrastructure
- Experience working with variety of monitoring systems such as Zabbix, vRealize Operations Manager, Nagios and Science Logic
- Experience and proficiency in integrating with ServiceNow or similar IT service management platforms.
- Experience with managing automations within a monitoring environment.
- Ability to provide guidance with design, maintenance, and improvements to enterprise level monitoring solutions.
- Excellent verbal and written communication skills, ability to present complex ideas and designs to a variety of technical or non-technical stakeholders.
- Experience with design, implementation, and support of monitoring tools in a complex, multi-platform environment.
- High level of understanding monitoring requirements for Storage, Network, and Compute servers.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Recommended Jobs
FULL-STACK CLOUD DEVELOPER (Top Secret Clearance REQUIRED)
POSITION DESCRIPTION We are looking for a Full-Stack Developers to play a lead role in the design, development, and deployment of applications, microservices, APIs, and web applications to a true D…
Territory Account Manager
In our ‘always on’ world, we believe it’s essential to have a genuine connection with the work you do. In our ‘always on’ world, we believe it’s essential to have a genuine connection with the wor…
Team Manager- 3rd Shift
Every day we get opportunities to make a positive impact on our colleagues partners customers and society. Together were pioneering the solutions of the future and unlocking the full potential of pre…
Identity, Credential and Access Management Systems Engineer
Tetrad Digital Integrity (TDI) is a leading-edge cybersecurity firm with a mission to safeguard and protect our customers from increasing threats and vulnerabilities in this digital age. TDI is seeki…
Engineer - DoubleTree by Hilton Crystal City
The team of hospitality professionals at the DoubleTree by Hilton Washington DC - Crystal City as an engineer! This 627-room property is half a mile south of The Pentagon near I-395 and the Pentag…
Accounts Payable Specialist
Job Description: We are seeking a detail-oriented and experienced Accounts Payable Specialist to join our finance team. The ideal candidate will have strong experience with NetSuite and advanced…
Personal Care Aide - Giles
Are you a NA/CNA or PCA who loves caregiving? If so, join the Team Teal family today & provide one-on-one care in the home for clients and enjoy all aspects of caregiving that led you to choose this …
Java Developer - Active TS/SCI with FSP
Location: Herndon/Chantilly, VA Clearance Required: TS/SCI with Full Scope Polygraph (FSP) Position Type: Full-Time, On-Site Salary Range: $190,000 – $210,000 We are seeking a…
Functional Specialist / Advisor (Technology Directorate)
Program Overview Engineering modernization and integration practices to include Digital Engineering and Agile at Scale for all of NGA, NSG, and ASG. About The Role Serve as a technical adv…
Unarmed Security Patrol Officer (Fairfax/Centreville/Herndon/Manassas)
Unarmed Security Officer - Fairfax, Centreville, Herndon, Aldie, and Manassas - Immediate Start! Current DCJS 01E registration is required before first shift assignment, but not required for consid…