Cloud engineering has a credentialing problem. AWS alone has issued more than 1.3 million certifications as of 2024. The market is full of certified candidates who can describe a VPC, draw a reference architecture, and name the right services for a given pattern — and who have never operated a production system under real failure conditions, debugged a VPC routing issue causing a live outage, or optimized a workload that was running at 5x the necessary cost.
Certifications measure structured exposure. They do not measure the operational judgment that makes a cloud engineer valuable at 2 AM when an availability zone is down and the auto-scaling policy is behaving unexpectedly. Hiring cloud engineers well requires evaluation criteria that filter for operational depth, not paper credentials.
Cloud Engineer Profiles: What You Are Actually Hiring
Cloud engineering spans a wide range of specializations. Before posting the role, identify which profile you actually need:
| Profile | Primary Focus | Key Differentiator |
|---|---|---|
| **Cloud Infrastructure Engineer** | VPC design, compute provisioning, networking, security groups, IAM policy design, cost optimization | Designs and owns the foundational cloud environment |
| **Cloud Platform / DevOps Engineer** | CI/CD pipelines, container orchestration, deployment automation, developer tooling | Builds the platform that application engineers deploy onto |
| **Cloud Architect** | Multi-service solution design, cloud-native architecture patterns, migration strategy, cost modeling | Designs how multiple services interact to solve business requirements |
| **FinOps / Cloud Cost Engineer** | Cost allocation, spend visibility, rightsizing, reserved capacity management | Owns the cloud bill and drives cost-efficiency at organizational scale |
| **Cloud Security Engineer** | IAM hardening, compliance automation, security posture management, threat detection | Ensures cloud environments meet security and compliance requirements |
Cloud engineering overlaps significantly with DevOps; for the related role, see hiring DevOps engineers.
For most mid-sized engineering teams (20–200 engineers), the first cloud hire is a Cloud Infrastructure or Cloud Platform Engineer — a generalist who can own both infrastructure provisioning and deployment automation. Specialist roles (FinOps, cloud security architect) become necessary at scale.
Cloud Engineer Skills by Provider and Seniority
Cross-Provider Foundational Skills
Before evaluating provider-specific depth, all cloud engineers should be proficient in concepts that translate across AWS, Azure, and GCP:
- Networking: VPC/VNet design, subnetting (CIDR ranges), routing tables, security groups / NSGs / firewall rules, load balancing (L4 and L7), DNS management
- IAM: Role-based access control, least-privilege policy design, service accounts vs. human identities, cross-account access patterns
- Container orchestration: Kubernetes fundamentals (pods, deployments, services, ingress), and at least one managed Kubernetes implementation (EKS, AKS, GKE)
- Infrastructure-as-Code: Terraform is the primary requirement. Provider-specific tools (CloudFormation, Azure Bicep) as supplementary
- Observability: Metrics, logs, and traces — how to instrument workloads and investigate production issues using cloud-native monitoring tools
For context on the backend services these cloud engineers provision infrastructure for, see how to hire a backend developer.
AWS-Specific Depth
| Level | Expected AWS Knowledge |
|---|---|
| **Junior** | EC2, S3, RDS, IAM basics, VPC fundamentals, basic Lambda, CloudWatch logs |
| **Mid-level** | VPC design (multi-AZ, private/public subnet patterns), EKS, ALB/NLB, Auto Scaling Groups, Secrets Manager, cost monitoring, Terraform for AWS resources |
| **Senior** | Multi-account architecture (AWS Organizations, Control Tower), service control policies, Transit Gateway, ECS/EKS at scale, FinOps (Reserved Instances, Savings Plans, Spot Fleet), WAF and Shield, custom CloudWatch metrics and alarms, disaster recovery design |
Azure-Specific Depth
| Level | Expected Azure Knowledge |
|---|---|
| **Junior** | Azure VMs, Blob Storage, Azure SQL, Azure AD basics, NSGs, Azure Monitor |
| **Mid-level** | VNet design, AKS, Azure DevOps or GitHub Actions with Azure, Managed Identities, Key Vault, Azure Cost Management |
| **Senior** | Management Groups and Policy, Azure Landing Zone design, Azure AD B2B/B2C, Private Link and Private Endpoints, Azure Sentinel, multi-subscription governance |
GCP-Specific Depth
| Level | Expected GCP Knowledge |
|---|---|
| **Junior** | GCE, GCS, Cloud SQL, IAM basics, VPC fundamentals, Cloud Logging |
| **Mid-level** | GKE, Cloud Run, BigQuery basics, Cloud Build, VPC Service Controls, IAM policy design |
| **Senior** | Organization policy design, Shared VPC, BigQuery optimization, Vertex AI integration, multi-project governance, GCP cost modeling |
Interview Questions That Reveal Cloud Engineering Depth
Architecture Design
"Design the network architecture for a multi-tier web application on AWS that needs to handle 100K RPS with high availability across two regions. Walk me through the VPC design, load balancing, and database tier."
Strong answers include: a multi-AZ VPC with separate public (load balancers, NAT gateway), private (application layer), and isolated (database) subnets; ALB for L7 load balancing with appropriate health checks; Auto Scaling Groups for the application tier; RDS Multi-AZ with read replicas; CloudFront for global CDN; Route 53 with health-based routing between regions; and a discussion of the cost implications of the multi-region design.
Weak answers: candidates who design everything in the default VPC, who don't mention private subnets for the application and database tiers, or who describe services without explaining why they chose one over alternatives.
Incident Response
"Your application in AWS suddenly has a 500% increase in latency. CloudWatch shows normal CPU and memory on EC2 instances. What are your first five investigation steps?"
This question tests operational thinking under pressure. Strong answers follow a systematic diagnostic path: check RDS and ElastiCache metrics (database latency is the most common web app latency source), check ALB target group health checks and connection errors, check VPC flow logs for unusual traffic patterns or connection drops, check recent deployments that might have introduced a regression, and check third-party API dependencies if the application makes external calls. Candidates who jump to a conclusion without systematic diagnosis have not operated production systems under real incident pressure.
Infrastructure-as-Code
"You have a Terraform configuration for an EKS cluster that works in development but has been drifting from the actual state in production. How do you handle this situation safely?"
Drift management in production is a real operational problem. Strong answers: run terraform plan to see the full scope of drift before taking any action; investigate why drift happened (manual changes, automation outside Terraform, provider updates); import manually created resources if they should be managed; discuss the tradeoff of running terraform apply to reconcile vs. accepting current production state and updating Terraform to match; and implement drift detection alerts going forward. A candidate who says "just run terraform apply" without assessing the impact has not managed production infrastructure.
Cost Optimization
"Your company's AWS bill is $200K/month. The CFO wants a 30% reduction without compromising performance or reliability. Where do you start?"
This tests FinOps thinking. Strong answers prioritize by impact: analyze the Cost Explorer breakdown by service to find the largest cost centers; look at EC2 spending for On-Demand instances that could be converted to Reserved Instances or Savings Plans (30–40% discount for 1-year commit); identify oversized instances by analyzing CloudWatch CPU and memory metrics over the past 90 days; review S3 lifecycle policies and data tier usage (S3 Intelligent-Tiering for infrequently accessed data); examine data transfer costs (often underestimated). A $200K bill with a 30% reduction target implies ~$60K/month savings — strong candidates will estimate which category gets them there fastest.
Red Flags in Cloud Engineer Candidates
- Lists cloud services without using them: Ask follow-up questions about any cloud service on the resume. "You mention CloudFront — describe the last time you configured a distribution. What was the caching behavior, and did you handle HTTPS certificate management?" Candidates who listed services based on familiarity without implementation experience will lose specificity immediately.
- No incident experience: Any cloud engineer with 2+ years of production experience has been part of a significant incident. Inability to describe a production incident they participated in suggests either very limited scope of work or lack of accountability. Ask specifically: "Tell me about a production incident you were part of that was cloud infrastructure-related. What caused it and what was your role in the response?"
- Terraform used only for learning: Terraform is the standard IaC tool, and many candidates have done Terraform tutorials without applying it to production infrastructure. Distinguish by asking about their remote state configuration, how they handle Terraform state locking in a team environment, and how they've managed breaking changes between Terraform provider versions.
- No cost awareness: Cloud engineers who have never looked at their company's cloud bill are not operating as infrastructure owners — they're treating cloud as an unlimited resource. Ask: "What was the approximate monthly cloud spend at your last company, and what were the two or three largest cost categories?" Candidates who genuinely owned cloud infrastructure will have this number.
- Certification without depth: A candidate with AWS Solutions Architect Professional who can't explain when to use a Transit Gateway vs. VPC Peering, or who doesn't know the difference between an SCp and an IAM policy in an AWS Organizations context, has a paper credential without the operational depth it implies.
How to Structure the Cloud Engineering Hiring Process
Cloud engineering hiring should evaluate both design ability and operational judgment — two distinct skills that separate candidates who can architect from candidates who can operate.
- Job description specificity: Name the primary cloud provider, required IaC tool (Terraform), and whether the role is infrastructure-focused, platform-focused, or both. Include the approximate scale (how many workloads, what traffic level, what monthly cloud spend if relevant).
- Resume screen (7–10 min): Look for specific cloud services with operational context (not just listed), infrastructure-as-code mentioned with specifics (Terraform module authorship, state management), and evidence of production incident experience.
- Architecture design exercise (60 min live or take-home): A cloud architecture design question for a realistic scenario (multi-tier application, data pipeline, global CDN configuration). Evaluate: appropriate service selection, security and networking design, and cost awareness.
- Operational depth interview (45 min): An incident scenario question (latency spike, availability issue, cost spike) and an IaC-specific question. These reveal whether the candidate has operated systems under pressure.
- Technical depth questions (30 min): Provider-specific depth questions matched to the target platform and role level.
For the broader software engineering hiring context these roles fit within, see the end-to-end software engineer hiring guide.
| Stage | Primary Signal | Target Pass Rate |
|---|---|---|
| Resume screen | Service specificity, IaC evidence, incident experience | 10–15% |
| Architecture design | Service selection quality, security and cost awareness | 30–40% |
| Operational depth | Incident response reasoning, IaC production usage | 35–50% |
| Technical depth | Provider-specific knowledge matched to role requirements | 40–55% |
How Nextmantra AI Approaches This
Cloud engineering first-round screens are expensive to run well because the evaluator needs production-level cloud knowledge to distinguish between architecture diagram fluency and operational depth. A hiring manager without cloud operations experience can't probe whether a candidate's incident response answer reflects real production experience or a well-rehearsed textbook answer.
Nextmantra AI conducts first-round technical screens for cloud engineer roles with adaptive questioning on architecture design, incident response, IaC production usage, and cost management — probing past the rehearsed answers to find where actual operational knowledge stops. For candidates claiming AWS certification plus production experience, the AI determines whether those are redundant (certification confirms what production experience already shows) or contradictory (certification without the operational depth the credential implies). Your cloud architects and platform leads spend their evaluation time on candidates who have demonstrated real production ownership.
See how Nextmantra AI handles this
Frequently Asked Questions
What is the difference between a cloud engineer and a DevOps engineer?
Cloud engineers specialize in cloud infrastructure design, provisioning, cost optimization, and cloud-native service architecture. DevOps engineers focus on CI/CD pipelines, deployment automation, container orchestration, and development-operations feedback loops. Many organizations use the titles interchangeably, and the roles overlap significantly in practice.
Do cloud engineers need AWS certifications?
Certifications demonstrate structured exposure to cloud concepts but validate knowledge recall, not operational competence. Use certifications as a rough triage signal, not a qualification bar. A candidate with production incident experience and no certifications typically has more practical value than a certified candidate who has only deployed demo environments. The interview should do the actual filtering.
What skills should a cloud engineer have?
Core cloud engineer skills: Terraform (IaC), networking fundamentals (VPC, subnetting, security groups, load balancers, DNS), container orchestration (Kubernetes / EKS/AKS/GKE), CI/CD pipeline design, IAM and security basics, and cost monitoring and optimization (FinOps fundamentals).
What is a realistic salary range for a cloud engineer?
In the US, mid-level cloud engineers earn $120K–$160K and senior cloud engineers earn $160K–$240K (Levels.fyi, 2024). Cloud architects reach $250K–$350K+. In India, mid-level cloud engineers earn 18–35 LPA, senior roles 35–70 LPA. FinOps, cloud security, and multi-cloud specialists command 15–25% premiums.
What is FinOps and should cloud engineers know it?
FinOps is the practice of managing and optimizing cloud spend in collaboration with finance stakeholders. Mid-level cloud engineers should understand right-sizing, Reserved Instances and Savings Plans, identifying waste, and storage tier cost implications. Senior cloud engineers should design cost allocation tagging strategies and model the cost impact of architectural decisions.
What is the difference between AWS, Azure, and GCP for cloud engineering?
AWS has the largest market share (32% in 2024), the broadest service catalog (200+ services), and the deepest talent pool. Azure is strongest in enterprise environments and Microsoft-stack workloads. GCP has the strongest data and AI/ML services (BigQuery, Vertex AI). Most cloud infrastructure concepts transfer across providers with tool-mapping knowledge.
How do you evaluate cloud engineers who claim multi-cloud experience?
Genuine multi-cloud engineers can explain the conceptual mapping between providers, describe an architectural decision choosing services across providers with specific tradeoffs, and articulate the operational complexity of managing multi-cloud workloads. Candidates who say "I've worked with AWS and Azure" without specifics likely have sequential single-cloud experience, not concurrent multi-cloud operational depth.
Should you require Terraform for cloud engineering roles?
Terraform has become the near-universal IaC standard across cloud providers. Requiring Terraform proficiency is reasonable for mid-level and senior cloud engineers. Junior cloud engineers should at minimum understand the IaC concept and be learning Terraform. Provider-specific tools are relevant in some environments, but Terraform fluency is broader and more valuable long-term.
Conclusion
Cloud engineering is a domain where credentials accumulate faster than production competence. The evaluation process that reliably finds strong cloud engineers probes three things that certifications don't measure: how candidates design architecture under realistic constraints (multi-tier, multi-region, budget-aware), how they diagnose and respond to production incidents, and whether their infrastructure-as-code experience is tutorial-level or production-owned. A candidate who can answer the production incident question systematically, describe real Terraform state management challenges they've faced, and estimate the cost implications of their architectural choices has the depth that makes the difference between infrastructure that works and infrastructure that works reliably.
Ready to screen cloud engineer candidates on operational depth before your platform engineers spend evaluation time on them? [See Nextmantra AI in practice](https://nextmantra.ai/platform)
Sources: AWS certification statistics 2024; Synergy Research Group cloud market share Q4 2024; Levels.fyi compensation data 2024; FinOps Foundation State of FinOps 2024; Stack Overflow Developer Survey 2023; HashiCorp State of Cloud Strategy 2023.
