About the Team
We are building a new-generation Platform Operations team that reimagines the traditional CloudOps/SRE model. Instead of siloed specialists, every engineer on this team is expected to operate across Cloud Infrastructure, Databases, Networking, Operating Systems, and Data Pipelines—powered by Generative AI as a force multiplier.
While you will develop deeper expertise in chosen domains, the baseline expectation is full-stack operational capability, AI-first problem solving, and the ability to mentor junior engineers joining the team.
Role Summary
As a Platform Operations Engineer II, you will be a key contributor responsible for the reliability, performance, and security of customer cloud infrastructure. You bring hands-on experience across multiple technology domains and are comfortable leading incident response, driving automation, and integrating AI into operational workflows. You will help define processes, build runbooks, and mentor entry-level engineers as we scale the team.
Key Responsibilities
- Own end-to-end incident management for customer cloud environments across GCP, Azure, and/or AWS—from detection through resolution and post-mortem.
- Lead troubleshooting across the full stack: cloud services, compute/OS, networking, databases, and application-layer dependencies.
- Design and implement automation for recurring operational tasks using scripting (Python, Bash, Go), configuration languages (YAML), and Infrastructure-as-Code (Terraform, Pulumi, or CloudFormation).
- Perform advanced database operations: performance tuning, replication management, migration planning, and disaster recovery testing across RDBMS and NoSQL systems.
- Configure and troubleshoot complex networking setups: hybrid connectivity, VPC peering, transit gateways, WAFs, and DDoS mitigation.
- Manage OS hardening, patch management strategies, and security compliance across Linux and Windows fleets.
- Monitor and troubleshoot data pipelines and ETL/ELT workflows on platforms such as Databricks and Snowflake; collaborate with data engineering teams on reliability and performance improvements.
- Integrate Generative AI into daily operations using platforms such as Google Gemini, Anthropic Claude, OpenAI, and open-source LLMs: build AI-assisted runbooks, use LLM-powered diagnostics, and evaluate new AI tools for the team.
- Define and track SLIs/SLOs for customer environments; produce capacity plans and reliability reports.
- Mentor Platform Operations Engineer I team members; conduct knowledge-sharing sessions and contribute to team training programs.
- Participate in architecture reviews and provide operational readiness assessments for new customer deployments.
- Drive continuous improvement by identifying patterns in incidents and proposing systemic fixes.
Required Qualifications
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 2–5 years of hands-on experience in Cloud Operations, SRE, DevOps, or Infrastructure Engineering roles.
- Solid working knowledge of at least one major cloud platform (GCP, Azure, or AWS) with practical experience in compute, storage, networking, IAM, and managed services.
- Proficiency in Linux system administration; working knowledge of Windows Server is a plus.
- Strong networking skills: TCP/IP, DNS, load balancing, VPNs, firewall rules, and network troubleshooting using tools like tcpdump, traceroute, and Wireshark.
- Hands-on experience with relational databases (PostgreSQL, MySQL, or SQL Server) including query optimisation, index management, and backup/restore workflows.
- Familiarity with NoSQL databases (MongoDB, DynamoDB, Redis, or Cassandra) in production environments.
- Working knowledge of data engineering fundamentals: data pipeline architectures, data platforms (Databricks, Snowflake), scheduling tools (Airflow, Step Functions), and common data formats and stores.
- Practical understanding of AI/ML concepts and demonstrated ability to use Generative AI tools (Google Gemini, Anthropic Claude, OpenAI, open-source LLMs) to accelerate operational work.
- Proficiency in at least one scripting/programming language (Python, Bash, or Go) and configuration languages (YAML), with experience writing automation and tooling.
- Experience with DevOps practices, Infrastructure-as-Code (Terraform preferred), and CI/CD pipelines.
- Excellent communication skills: ability to lead incident bridges, write clear post-mortems, and present operational reports to stakeholders.
Preferred Qualifications
- Associate cloud certification (e.g., GCP Associate Cloud Engineer, AZ-104, AWS Solutions Architect Associate).
- Experience with container orchestration (Kubernetes, ECS/EKS) and service mesh technologies.
- Familiarity with observability platforms (Datadog, New Relic, Grafana + Prometheus stack) and log management (ELK, Splunk).
- Exposure to building or consuming APIs for operational automation (REST, GraphQL).
- Experience with security and compliance frameworks (SOC 2, ISO 27001, CIS Benchmarks).
- Track record of building internal tools, ChatOps integrations, or AI-powered automation workflows.
- Prior experience mentoring junior engineers or leading small project teams.
What We Offer
- An opportunity to build a team from the ground up and shape its culture, tooling, and operational philosophy.
- Clear growth path toward specialisation (Cloud Architect, DBA Lead, Network Lead, or AI Operations Lead) or people management.
- Hands-on access to cutting-edge AI-augmented infrastructure operations.
- Exposure to diverse, complex customer environments across industries.
- Investment in certifications, conference attendance, and continuous learning.