We’re looking for a Staff Cloud Infrastructure Engineer to design, build, and operate scalable, secure, and highly available cloud platforms on AWS. You’ll own Infrastructure-as-Code (Terraform), container orchestration (EKS), GitHub-based workflows (Actions + source control), GitOps (Argo/Argo CD), and end-to-end CI/CD strategies with strong observability.
What You’ll Do
- Architect and deliver AWS infrastructure using Terraform (modular, reusable, versioned IaC) across multiple accounts and environments
- Design, deploy, and harden EKS clusters (networking, autoscaling, security, ingress, service mesh optional) and standardize app deployment patterns
- Implement GitOps with Argo/Argo CD (ApplicationSets, sync policies, RBAC, SSO, secrets management, progressive delivery)
- Build CI/CD pipelines with GitHub Actions (reusable workflows, artifact strategy, rollouts/rollbacks, automated quality gates)
- Define SLI/SLOs and enable observability with metrics, logs, traces, alerting, and runbooks using best-in-class tools
- Drive platform reliability and security: patching, upgrades, backups, disaster recovery, IAM least privilege, secrets, and compliance guardrails
- Optimize performance and cost: autoscaling, right-sizing, spot/on-demand strategies, efficiency dashboards
- Partner with developers to create golden paths, templates, and documentation that accelerate safe delivery
- Participate in on-call, lead incident RCAs, and drive improvements through code and process changes
Required Experience
- 5–8+ years building and operating production cloud infrastructure (preferably AWS)
- Strong Terraform expertise: modules, workspaces, state management, CI validation (fmt/validate/tflint), policy as code (Sentinel/OPA)
- GitHub & GitHub Actions: branch protection, environments, reusable workflows, OIDC to AWS, secrets/variables, caching, matrices
- AWS services: VPC, IAM, EKS, ALB/NLB, EC2/EKS node groups, ECR, RDS/ElastiCache, S3, CloudWatch/CloudTrail, KMS, Secrets Manager
- EKS & container orchestration: Helm/Kustomize, rollout strategies, cluster autoscaler, networking (CNI/ingress), HPA/PDBs
- Argo & Argo CD: Applications & ApplicationSets, sync/health checks, drift detection, RBAC, SSO
- Observability tools: metrics/logs/traces (Prometheus/Grafana, OpenTelemetry, Datadog, New Relic, Honeycomb, CloudWatch, ELK/OpenSearch)
- CI/CD strategies: trunk-based, quality gates (tests, SAST/DAST, IaC scans), artifact/versioning, environment promotions, canary/blue-green
- Strong scripting (Bash/Python), configuration tooling (Helm/Kustomize), solid Linux and networking fundamentals
- Clear communicator with a bias for automation, documentation, and collaborative problem-solving
Nice to Have
- Service Mesh (Istio/Linkerd), progressive delivery (Argo Rollouts/Flagger)
- Security tooling (Snyk, Trivy, AWS Security Hub, IAM Access Analyzer)
- Data plane experience (Kafka/MSK, Redis/ElastiCache, RDS/Aurora operations)
- Cost management/FinOps exposure for Kubernetes and AWS