Site Reliability Engineer
About Jefferson Lab Join a community with a common purpose of solving the most challenging scientific and engineering problems of our time. The Jefferson Lab campus is located in southeastern Virginia amidst a vibrant and growing technology community. A career at Jefferson Lab is more than a job. You will be part of “big science” and work alongside top scientists and engineers from around the world unlocking the secrets of our visible universe. Managed by Jefferson Science Associates, LLC, Thomas Jefferson National Accelerator Facility is entering an exciting period of mission growth and is seeking new team members ready to apply their skills and passion to have an impact. You could call it work, or you could call it a mission. We call it a challenge. We do things that will change the world. In this job you will:
- Work closely with the rest of the architecture team to review and influence technology choices to establish reliability, and resilience parameters (e.g., meeting expected availability, failure domain isolation, disaster recovery)
- Ensure the selected software and hardware systems meet those parameters, while also meeting performance expectations and security requirements.
- Evaluate vendor and open-source solutions against established reliability and resilience parameters, develop comparative assessments, and provide technically grounded recommendations to inform architecture decisions and support acquisitions.
- Metrics & Observability: Establish the foundation for system observability, defining initial SLOs/SLIs, architecting, prototyping and then implementing comprehensive monitoring, logging, and alerting solutions.
- Lead the design, prototyping and implementation of these solutions including custom automation to eliminate manual operations and further improve facility resilience.
- Performance Engineering: Participate in testing and performance analysis to validate reliability and resilience design decisions, to identify bottlenecks and alternative approaches.
- Establish SRE Team Framework: Define the operational framework, on-call structures, incident response, other operational processes, and staffing plans for the future SRE team, bridging the design-to-operations transition.Experience
- Required: 10 or more years SRE (Site Reliability Engineering), DevOps, or Systems Engineering rolesEducation
- Required: Bachelor's Degree Computer Science or related field
- Preferred: Master's Degree Computer Science or related fieldExperience and Education Exchange
- High: Deep experience and understanding of distributed systems principles, failure modes, consensus protocols and self-healing architectures.
- High: Expertise in defining and implementing SLOs and SLIs and comprehensive monitoring stacks and experience architecting observability frameworks in greenfield environments (e.g. Prometheus, ELK, OpenTelemetry)
- High: Strong scripting and automation skills (Go, Python, Shell).
- Medium: Deep experience with public cloud environments (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Medium: Experience with configuration management and IaC tools (e.g., Terraform, Puppet, Ansible).
- Medium: Experience with IPv4 and IPv6 networking, high-speed interconnects and data transfer protocols, familiarity with network reliability patterns and software-defined networking (pref)
- Low: Experience with HPC infrastructure and environments (pref)
- Low: Experience leading or mentoring small teams (pref)
- Medical, Dental, and Vision Care Plans • Flexible Spending Accounts
- Paid Time-off and Leave Programs (Paid Parental, vacation, holidays, and sick leave)
- 401(k) Plan – 9% Lab Contribution; 100% vested • Flexible Work Arrangements
- Tuition Assistance, Training and Professional Development Programs
- Live near the waterways of the Chesapeake Bay region with access to nearby beaches,
Recommended Jobs
Desktop Technician|Req#4538
Description Position: Desktop Technician (Mid-Level) Location: Rosslyn, VA, Clearance: Secret ~ Desktop Technician - Mid Level The Desktop Technician provides techni…
Baker/Decorator
Job Description Job Description Come join our team at JB Cakes! We are looking to fill evening and night shift positions for seasonal help. Positions include but not limited to baking cakes, cupca…
Travel Nurse RN - Intensive Care Unit (ICU) / Critical Care - $2,128 per week in Newport News, VA
Registered Nurse (RN) | Intensive Care Unit (ICU) / Critical Care Location: Newport News, VA Agency: Cynet Health Pay: $2,128 per week Shift Information: Days - 3 days x 12 hours …
Travel RN-OR-Operating Room in Falls Church, Virginia
Job Description Job Description Looking to level up your career and boost your income? At Voyage Healthcare, we help connect nurses, therapists, and allied health pros with high-paying travel job…
Airport Supervisor - HVAC
Job Description Job Description Capital Region Airport Commission HVAC Supervisor Full Time Richmond International Airport Pay Range: $70,350.79-$77,308.57 About Us: Established in…
Tenant Services Administrator
Job Description Job Description Tenant Services Administrator Location: Falls Church, VA (On-site) Schedule: Monday - Friday, 8:30 AM - 5:00 PM Employment Type: Contract (potential for…
Senior Software Engineer, Video Compression
Senior Software Engineer, Video Compression About Ofinno: Ofinno is a leading research and development lab headquartered in Reston, Virginia, specializing in advancing communication and media s…
Caregiver- URGENTLY HIRING (FARSI SPEAKER)
Alliance Home Care is dedicated to providing compassionate, high-quality care to individuals in the comfort of their own homes. We strive to enhance the quality of life for our clients through persona…
Maintenance Technician
Do you have RECENT experience as a maintenance technician in a manufacturing environment? Are you completely comfortable with $32-$33/hr PLUS benefits?? Are you able to work 3rd shift from 7pm-7am on…
Account Manager
Job Description Job Description Global Guardian protects and delivers employees and families from political, environmental, and bad actor threats worldwide. We are a leading provider of emergency…