Lead Site Reliability Engineer
Company: Intellum, Inc.
Location: Atlanta
Posted on: February 16, 2026
|
|
|
Job Description:
Job Description Job Description About us Intellum is the leader
in corporate education technology and powers the largest, most
successful customer, partner, and employee learning programs in the
world. Large brands and fast-moving companies like Google, Meta,
Amazon, Walmart, Xero, Atlassian, Mailchimp, Airbnb, Stripe, and
TikTok rely on Intellum to engage and educate the audiences they
touch. We have always been a "remote first" company and are proud
to have team members located all over the world. We value
Curiosity, Creativity, Perseverance, and Kindness and strive to
demonstrate these core values every day. Our culture is very
important to us. We invest in our people in fun and exciting ways,
including personal development budgets and an annual all-company
retreat that is focused less on work and more on human connections.
We are in growth mode, and our "smart growth" approach ensures that
we will continue to scale our company effectively. Summary We are
seeking a Lead Site Reliability Engineer to spearhead our SRE team.
You are not just an operator; you are an experienced software
engineer who excels at architecture, code optimization, and deep
troubleshooting. In this role, you will drive operational maturity
by defining our reliability standards (SLOs), hardening our
security posture (WAF/InfraSec), and scaling the Intellum platform.
Our stack Core : Applications written in Ruby on Rails and Node.js,
PostgreSql, MongoDB,, Redis, Memcached, Sidekiq, ActiveJob,
Elasticsearch, Websockets Infrastructure : 100% Linux-based cloud
infrastructure (AWS, Google Cloud, MongoDB Atlas) and services
(ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL,
BigQuery etc.) Infrastructure as Code (IaC) : GitHub, Terragrunt,
Terraform, Ansible CI/CD: Spinnaker, Jenkins Observability &
Alerting : New Relic, AWS CloudWatch, Google Cloud Stackdriver,
Squadcast Agile/Scrum practices utilizing JIRA Responsibilities SRE
Leadership & Strategy: Set clear goals for the SRE team and partner
with Engineering leadership to align platform initiatives with
business objectives. Reliability & Observability (SLA/SLO): Lead
the definition and enforcement of SLAs, SLIs, and SLOs. Architect
observability frameworks to translate telemetry data into
actionable roadmaps that reduce toil and enhance resilience. Core
Engineering & Performance: Take ownership of critical code
components (i.e., Queues, Enrollments) and lead efforts to identify
bottlenecks, optimize performance, and improve code quality across
the engineering department. Security by Design: Champion
infrastructure security. Partner with InfoSec to define hardening
standards, manage perimeter defense (WAF/DDoS), and automate
vulnerability remediation within the CI/CD pipeline. Incident
Command: Participate in the 24x7 on-call rotation and lead
post-incident reviews (RCAs), ensuring action items are implemented
to improve MTTR and prevent recurrence. Mentorship: Empower
developers with better tooling and guidance on performant coding
practices, fostering a culture of collaboration and reliability and
"you build it, you run it". Required Skills Experience &
Engineering 10 years of engineering experience, with 5 years
specifically developing Ruby on Rails applications. Expertise in
Cloud Computing (AWS/GCP) and Infrastructure as Code
(Terraform/Ansible). Strong proficiency with SQL databases
(PostgreSQL) and the ability to quickly navigate and optimize
complex, unfamiliar codebases. SRE & Operations Deep Observability:
Proven experience designing monitoring solutions (Datadog, New
Relic, Prometheus) based on the "Golden Signals". SLO Governance:
Demonstrated ability to define SLIs/SLOs from scratch, negotiate
Error Budgets, and use data to balance feature velocity with
reliability. Security Focus: Experience securing cloud environments
and container platforms (Kubernetes), including hands-on management
of WAF rules and edge security. Incident Management: Experience
leading post-incident reviews (RCAs) and implementing action items
that directly improve MTTR (Mean Time to Recovery) and MTTD (Mean
Time to Detection). Leadership Proven experience leading technical
teams, mentoring engineers, and working in a team-oriented,
collaborative environment with strong communication skills.
Documentation & Training: Skilled in documenting solutions and
training operational teams on how to effectively support and
maintain systems. Proactive Problem-Solving: Demonstrated ability
to communicate clearly, seek help proactively, and take ownership
of tasks, leading them to completion. Bonus Skills Automation
Tools: Experience in developing solutions using server automation
tools such as Terraform, Ansible. CI/CD Expertise: Experience in
writing and maintaining CI/CD pipelines and services. Kubernetes:
Experience in building, deploying, and optimizing Kubernetes-based
infrastructure Perimeter Defense: Experience configuring and
managing Web Application Firewalls (WAF) (e.g., Cloudflare, AWS
WAF, Akamai) and DDOS protection mechanisms. Education Bachelor's
degree in Computer Science or related technical field BENEFITS
Medical - 100% of employee premiums for selected individual plans
Dental - 100% of employee premiums covered Vision - 100% of
employee premiums covered LinkedIn Learning 401(k) plus matching
(US Based Only) Unlimited PTO Calm subscription Annual Company
Retreat Intellum is an equal-opportunity employer. We're committed
to building an inclusive team that celebrates diversity in people,
perspectives, and backgrounds regardless of race, color, national
origin, gender, sexual orientation, age, religion, disability,
citizenship, veteran status, or any other protected status. We
encourage you to apply for an open position and if you have
questions about whether or not your job experience and skill set
meet the requirements for a specific role, reach out to us directly
at careers@intellum.com. If you are an individual applying from CA,
NY, CO, CT, MD, NV, or RI, please reach out to careers@intellum.com
to inquire about specific pay ranges.
Keywords: Intellum, Inc., Alpharetta , Lead Site Reliability Engineer, IT / Software / Systems , Atlanta, Georgia