Lead Site Reliability Engineer

Company: Intellum, Inc.
Location: Atlanta
Posted on: February 16, 2026

Job Description:

Job Description Job Description About us Intellum is the leader in corporate education technology and powers the largest, most successful customer, partner, and employee learning programs in the world. Large brands and fast-moving companies like Google, Meta, Amazon, Walmart, Xero, Atlassian, Mailchimp, Airbnb, Stripe, and TikTok rely on Intellum to engage and educate the audiences they touch. We have always been a "remote first" company and are proud to have team members located all over the world. We value Curiosity, Creativity, Perseverance, and Kindness and strive to demonstrate these core values every day. Our culture is very important to us. We invest in our people in fun and exciting ways, including personal development budgets and an annual all-company retreat that is focused less on work and more on human connections. We are in growth mode, and our "smart growth" approach ensures that we will continue to scale our company effectively. Summary We are seeking a Lead Site Reliability Engineer to spearhead our SRE team. You are not just an operator; you are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting. In this role, you will drive operational maturity by defining our reliability standards (SLOs), hardening our security posture (WAF/InfraSec), and scaling the Intellum platform. Our stack Core : Applications written in Ruby on Rails and Node.js, PostgreSql, MongoDB,, Redis, Memcached, Sidekiq, ActiveJob, Elasticsearch, Websockets Infrastructure : 100% Linux-based cloud infrastructure (AWS, Google Cloud, MongoDB Atlas) and services (ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL, BigQuery etc.) Infrastructure as Code (IaC) : GitHub, Terragrunt, Terraform, Ansible CI/CD: Spinnaker, Jenkins Observability & Alerting : New Relic, AWS CloudWatch, Google Cloud Stackdriver, Squadcast Agile/Scrum practices utilizing JIRA Responsibilities SRE Leadership & Strategy: Set clear goals for the SRE team and partner with Engineering leadership to align platform initiatives with business objectives. Reliability & Observability (SLA/SLO): Lead the definition and enforcement of SLAs, SLIs, and SLOs. Architect observability frameworks to translate telemetry data into actionable roadmaps that reduce toil and enhance resilience. Core Engineering & Performance: Take ownership of critical code components (i.e., Queues, Enrollments) and lead efforts to identify bottlenecks, optimize performance, and improve code quality across the engineering department. Security by Design: Champion infrastructure security. Partner with InfoSec to define hardening standards, manage perimeter defense (WAF/DDoS), and automate vulnerability remediation within the CI/CD pipeline. Incident Command: Participate in the 24x7 on-call rotation and lead post-incident reviews (RCAs), ensuring action items are implemented to improve MTTR and prevent recurrence. Mentorship: Empower developers with better tooling and guidance on performant coding practices, fostering a culture of collaboration and reliability and "you build it, you run it". Required Skills Experience & Engineering 10 years of engineering experience, with 5 years specifically developing Ruby on Rails applications. Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible). Strong proficiency with SQL databases (PostgreSQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases. SRE & Operations Deep Observability: Proven experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the "Golden Signals". SLO Governance: Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability. Security Focus: Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security. Incident Management: Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detection). Leadership Proven experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills. Documentation & Training: Skilled in documenting solutions and training operational teams on how to effectively support and maintain systems. Proactive Problem-Solving: Demonstrated ability to communicate clearly, seek help proactively, and take ownership of tasks, leading them to completion. Bonus Skills Automation Tools: Experience in developing solutions using server automation tools such as Terraform, Ansible. CI/CD Expertise: Experience in writing and maintaining CI/CD pipelines and services. Kubernetes: Experience in building, deploying, and optimizing Kubernetes-based infrastructure Perimeter Defense: Experience configuring and managing Web Application Firewalls (WAF) (e.g., Cloudflare, AWS WAF, Akamai) and DDOS protection mechanisms. Education Bachelor's degree in Computer Science or related technical field BENEFITS Medical - 100% of employee premiums for selected individual plans Dental - 100% of employee premiums covered Vision - 100% of employee premiums covered LinkedIn Learning 401(k) plus matching (US Based Only) Unlimited PTO Calm subscription Annual Company Retreat Intellum is an equal-opportunity employer. We're committed to building an inclusive team that celebrates diversity in people, perspectives, and backgrounds regardless of race, color, national origin, gender, sexual orientation, age, religion, disability, citizenship, veteran status, or any other protected status. We encourage you to apply for an open position and if you have questions about whether or not your job experience and skill set meet the requirements for a specific role, reach out to us directly at careers@intellum.com. If you are an individual applying from CA, NY, CO, CT, MD, NV, or RI, please reach out to careers@intellum.com to inquire about specific pay ranges.

Keywords: Intellum, Inc., Alpharetta , Lead Site Reliability Engineer, IT / Software / Systems , Atlanta, Georgia

Didn't find what you're looking for? Search again!

Let Atlanta recruiters find you. Post your resume for free!

Get Atlanta IT / Software / Systems jobs via email.

View more Alpharetta IT / Software / Systems jobs

Other IT / Software / Systems Jobs

Technical Support Specialist
Description: Job Description Job Description Job description JOB TITLE: Junior / Level 1 Help Desk Technician SALARY RANGE: 45,000 - 50,000 Benefits HOURS OF WORK: 40 hours per week LEAVE ENTITLEMENT: 15 days per (more...)
Company: SMB IT Solutions
Location: Smyrna
Posted on: 02/21/2026

IAM Ping DaVinci Engineer
Description: Job Description Job Description Valorem Reply, part of the Reply Network, is a leader in security-focused digital transformation. We advise and support our clients across cloud security, identity, and (more...)
Company: Reply
Location: Atlanta
Posted on: 02/21/2026

Senior Software Configuration/Release Engineer
Description: Job Description Job Description This position is responsible for automating and maintaining cloud configuration and infrastructure processes across -AWS environments. On-site interviews only Hybrid - (more...)
Company: Apidel Technologies
Location: Atlanta
Posted on: 02/21/2026

Salary in Alpharetta, Georgia Area | More details for Alpharetta, Georgia Jobs |Salary

Splunk Cloud Administrator
Description: Job Description Job Description Description: Salary: 95,000- 115,000 Work location : Atlanta, GA. There is the possibility of occasional remote work. The Splunk Cloud Administrator will support cybersecurity (more...)
Company: Gunnison Consulting Group, Inc.
Location: Atlanta
Posted on: 02/21/2026

Firm Operations Assistant
Description: Job Description Job Description About Our Firm At V amp T, we are more than a family law firm - we are a team of dedicated professionals who genuinely care about helping families navigate some of the (more...)
Company: Vayman & Teitelbaum, P.C.
Location: Alpharetta
Posted on: 02/21/2026

Jr Provisioning Technician
Description: Job Description Job Description Description: REPORTS TO: Provisioning Manager Based in Atlanta, GA, EyeQ Monitoring provides industry-leading security and business intelligence solutions to a variety (more...)
Company: EyeQ Monitoring
Location: Atlanta
Posted on: 02/21/2026

Project Manager 3
Description: Job Description Job Description HI Hope you're doing well This is Pankaj from 4P Consulting please see below job description Project Manager 3 Atlanta, GA Hybrid Client State of GA 2 resume by 12/30/2024 (more...)
Company: 4P Consulting Inc.
Location: Atlanta
Posted on: 02/21/2026

Localisation Project Manager
Description: Job Description Job Description Are you interested in being a part of an international environment and making a difference at LanguageWire Then you might be the one we are looking for The role you (more...)
Company: LanguageWire
Location: Alpharetta
Posted on: 02/21/2026

Full Stack Application Developer 1 or 2
Description: Job Description Job Description This position will help advance the Data Management and Data Analytics strategy by increasing our capabilities to deliver rapidly and efficiently while providing excellent (more...)
Company: 4P Consulting Inc.
Location: Atlanta
Posted on: 02/21/2026

Data Center Technician - Inside Plant (ISP)
Description: Job Description Job Description National Technologies NTI , a Network Connex Company, is a premier turnkey installer of fiber optic and data center infrastructure. Working on our team puts you on the (more...)
Company: NTI Connect LLC
Location: Norcross
Posted on: 02/21/2026

Loading more jobs...

Lead Site Reliability Engineer

Didn't find what you're looking for? Search again!

Other IT / Software / Systems Jobs

Log In or Create An Account