[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Attain Finance is a leading consumer credit lender with over 50 years of expertise in providing credit solutions across the U.S. They are seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of their software delivery systems. The role involves hands-on work across various technologies to ensure the stability and efficiency of their applications in production.
Responsibilities
- Build and operate the delivery platform. Work across AWS, EKS, ArgoCD, Helm, GitHub Actions, Azure DevOps, Terraform, and Python
- Fix the problems you own. Find root cause across the AWS and Kubernetes stack, fix it, and harden it so it stays fixed
- Respond to incidents. Help stabilize during outages, drive root-cause analysis, and ship corrective actions for your systems
- Standardize how we build and ship. Define reproducible container builds and GitOps paths on ArgoCD and Helm that replace manual deployment
- Help consolidate the CI estate. Standardize pipelines across GitHub Actions and Azure DevOps for your services — remove brittle steps and silent failures and improve visibility
- Support platform adoption. Build golden-path templates and tooling and help teams move services onto the platform
- Use progressive delivery. Canary and blue green deploys (Argo Rollouts) and automated rollback for the services you operate
- Build observability in. Wire golden-signal metrics, logs, and traces (Prometheus/Mimir, Loki, Tempo, OpenTelemetry) into your services, surfaced in Grafana with SLOs for your domain
- Operate production systems. Troubleshoot failed to deploy, respond to alerts, and improve behavior from real incidents
- Help meet SLOs and carry on call. Track reliability metrics for the services you operate and share the rotation
- Built across environments. Design dev, test, and prod for safe promotion, recovery from failed deployments, and zero-downtime upgrades
- Help set the standard. Build reference implementations for build, deploy, GitOps, promotion gates, and observability
- Uphold compliance with the pipeline. Support deployment traceability, approval trails, and segregation of duties for PCI DSS, SOC 2, SOX, and GLBA
- Cut toil and cost. Automate repetitive ops work and help tune EKS compute, CI runners, and observability cardinality
- Unblock across teams. Get hands-on with Cloud, Security, Application Engineering, Data, and Product to keep delivery moving
- Kill knowledge silos. Write docs, runbooks, and incident learnings, so engineers operate independently
Skills
- Kubernetes, ArgoCD, Helm, Terraform, Python. Deep hands-on production experience
- Hands-on AWS. Operate and debug EKS, ECS, EC2, ECR, IAM/IRSA, VPC networking, ALB/NLB, CloudWatch, Secrets Manager, and KMS
- GitHub Actions and/or Azure DevOps. Build and operate CI/CD at scale
- Grafana and the observability stack. Hands-on with Grafana dashboards and alerting, and the metrics, logs, and traces stack (Prometheus/Mimir, Loki, Tempo, OpenTelemetry)
- Strong scripting. Python and Bash, with the ability to grow into systems-level coding
- Production troubleshooting. Comfortable getting into a system under load, finding root cause, and fixing it
- Production ownership. Uptime and reliability accountability
- Incident response. You respond and help drive postmortems that yield real improvements
- Standards contribution. You contribute to engineering standards and best practices
- Compliance awareness. Experience in regulated or high-rigor environments or implementing audit and access controls in pipelines
- Mentorship. Through code review, examples, and pairing
- 5+ years in site reliability, platform, DevOps, or software engineering, with production ownership of systems or pipelines
- Advanced GitOps. ArgoCD (or Flux), reusable Helm patterns, Argo Rollouts
- CI consolidation or migration. Moving between CI systems, such as Azure DevOps to GitHub Actions
- Self-hosted observability at scale. Running Grafana, Mimir, Loki, and Tempo in production
- Supply chain security. SBOMs, artifact signing (Sigstore/cosign), SLSA provenance
- Platform migrations. Contributing to modernization with minimal disruption
- .NET / C#. Enough to containerize and reason about application workloads
- Low-level Kubernetes. Cilium/eBPF, Karpenter, or self-hosted networking and autoscaling
- Resilience testing. Chaos/failure injection or disaster recovery drills
- AI-assisted tooling. Responsible use with output validation
- Certification. AWS Solutions Architect, AWS DevOps Engineer, or CKA/CKAD
- Degree in computer science or equivalent practical experience
Benefits
- Flexible Paid Time Off Program
- Medical
- Dental
- Vision
- Life Insurance
- Disability
- Other voluntary coverages
- 401k program, starting on the first of the month following 30 days of employment with a company match
Company Overview