Site Reliability Engineer
Tax Analysts is seeking a Site Reliability Engineer (SRE) to help establish and shape our reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven organization and play a key role in ensuring the reliability, scalability, and performance of our AWS-hosted business applications.
As part of a cross-functional engineering team, you will work to improve observability, automate operational processes, and lead incident response and continuous improvement efforts. This role is ideal for a mid-level engineer with cloud and software engineering experience who is eager to deepen their expertise in site reliability engineering, learn from senior staff, and help build a culture of reliability.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
- Help define and implement service-level indicators (SLIs) and objectives (SLOs) for cloud-based applications.
- Build, configure, and maintain monitoring, alerting, and dashboarding solutions using AWS CloudWatch, X-Ray, and third-party tools such as DataDome.
- Leverage advanced AWS observability tools (e.g., CloudWatch Synthetics, Contributor Insights) to proactively monitor system health.
- Contribute to the development and implementation of a structured on-call support process as our reliability practice evolves.
- Implement monitoring, and maintain site protection and bot mitigation solutions, including DataDome, to defend against automated attacks and ensure application availability, and analyze performance during postmortems of incidents.
- Investigate incidents, security events, and operational anomalies, resolve, perform root cause analysis, and run a postmortem process.
- Identify repetitive or manual operational tasks (‘toil’) and design scripts or automations using AWS Lambda and CloudFormation to improve efficiency and reliability.
- Assist in the maintenance and enhancement of CI/CD pipelines and automated deployment processes.
- Work closely with development, QA, cloud, and DevOps teams to ensure reliability, scalability, and security are integrated into system and application designs.
- Contribute to the documentation of systems, processes, incident learnings, compliance, and reliability best practices.
- Stay current with emerging AWS, SRE, and observability technologies, and make recommendations to adopt new tools or approaches that improve system resilience and operational excellence.
- Participate in the evaluation and rollout of new AWS services and features that can benefit system reliability or team efficiency.
- Perform other related duties as assigned to support the team and organizational objectives.
KNOWLEDGE & SKILLS:
- Strong analytical, troubleshooting, and problem-solving abilities.
- Hands-on experience with AWS CloudWatch (metrics, logs, dashboards, alarms) for proactive monitoring and alerting.
- Familiarity with AWS X-Ray for distributed tracing and in-depth troubleshooting of microservices architectures.
- Experience leveraging tools like CloudWatch Synthetics and Contributor Insights for canary testing and log analytics.
- Knowledge of AWS CloudTrail for auditing and investigating API calls and security events.
- Experience using AWS Athena for ad-hoc querying and analysis of logs during incident investigations and postmortems.
- Proficiency with AWS CloudFormation for reliable and repeatable infrastructure provisioning.
- Experience automating operational tasks and workflows using AWS Lambda or similar event-driven services.
- Understanding of AWS services such as API Gateway, CloudFront, and Elastic Load Balancer (ELB) to ensure availability, scalability, and optimal performance of distributed systems.
- Experience working with site protection and bot mitigation solutions (such as DataDome or Cloudflare).
- Working knowledge of scripting or programming languages such as Python, Bash, or Node.js for automation and tooling.
- Excellent communication and documentation skills; ability to collaborate effectively with cross-functional teams.
- Eagerness to learn and adopt new tools, technologies, and best practices in cloud reliability and operations.
- Bachelor’s degree in computer science, engineering, or a related field; equivalent professional experience considered.
- 3+ years of professional experience in cloud engineering, DevOps, infrastructure, or observability roles (AWS required).
- Experience implementing SRE principles (prior work in an SRE role is a plus).
- Experience with monitoring, incident response, or reliability work in a production environment.
- Experience working in an Agile development environment, collaborating within cross-functional teams.
- Eagerness to help establish and improve site reliability practices while learning and applying best practices.
- Health/Dental/Vision
- 401K: Immediately vested
- Tuition assistance
- Qualified employer under the Public Service Loan Forgiveness program (PFSL)
- Generous Paid Time Off
- Dog-friendly office
- Private gym onsite
- Medical, Dental, Vision Insurance
- Health Savings Account (HSA)
- Flexible Spending Account (FSA)
- Employee Assistance Program (EAP)
- Life and AD&D Insurance
- Disability Insurance
- Pet Insurance
- Tuition Assistance
- Trade Publication/News Subscription Reimbursement
- Exercise Room
- Paid Holidays
- Vacation and Sick Leave
- Parental Leave
Tax Analysts is an Equal Employment Opportunity Employer.
Recommended Jobs
Outside Sales Representative - Home Improvement
Job Description Job Description Benefits/Perks Competitive Compensation Paid Time Off Career Growth Opportunities Job Summary We are seeking a highly motivated and energetic Outsi…
People & Culture Coordinator
Job Description Job Description Description The People & Culture Coordinator role forms part of our high performing, fun and collaborative People & Culture team at Nuix. We’re on a mission to be…
Contracts Manager
Job Description Job Description Description: Contracts Manager: At B&A, we foster and embrace a distinct set of values that we live by and instill in all aspects of our organization: dedica…
Host
Job Description Job Description Benefits: Company parties Employee discounts Flexible schedule Health insurance Moliar Hospitality Group is a locally owned and operated food and b…
Senior Systems Architect FSP
TENICA is looking to hire a Senior Systems Architect, with focus on IT Networks. ACTIVE TS/SCI with FULL SCOPE POLY CLEARANCE REQUIRED TO BE CONSIDERED FOR THIS POSITION As IT Architect Expert, you…
Painter Supervisor
Job Description Job Description Hi this is Art owner of a decent painting company in hampton roads area. We are class A painting contractor trying to get into virginia ship repair and shipyard pa…
Hospital | CT Tech
Travel CT Technologist Job – Hospital Assignment in Fredericksburg, VA Advance your healthcare career with a rewarding travel CT Technologist job in Fredericksburg, Virginia (22401). Join a leading…
Housekeeper
Primary duties include cleaning rooms, making beds, changing linens, restocking toiletries, and addressing guest requests. Here's a more detailed breakdown of a Hotel Housekeeper's job description: R…
Digital Workplace Solutions Architect
Req ID: 330787 NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organiza…
Program Analyst 2
Duties: This position includes various types of functional support to the ATO Services Unit to enable the organization which includes programmatic, communications, management, and training. Mus…