Location: Remote
Type: Full-time
About RacerX.AI
At RacerX.AI, we are passionate about optimizing website performance and providing cutting-edge solutions that make websites faster, more secure, and more efficient. Our mission is to help businesses around the world deliver exceptional user experiences by focusing on the reliability and speed of their web platforms. As a remote-first company, we thrive on collaboration, flexibility, and innovation.
Job Summary
We are looking for an experienced Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our infrastructure and services. You will work closely with our development, DevOps, and SysAdmin teams to build systems that are efficient, resilient, and maintain high availability. You will help us continue to deliver optimal website performance for our clients by ensuring our systems are robust and always ready for scaling.
Key Responsibilities
- Monitor and maintain the stability, availability, and scalability of our cloud infrastructure (AWS, Google Cloud, or Azure).
- Develop and implement strategies to optimize the reliability and performance of critical systems, minimizing downtime and failures.
- Automate operational tasks, monitoring, and alerting using tools like Prometheus, Grafana, and CloudWatch.
- Collaborate with DevOps and SysAdmin teams to design and implement CI/CD pipelines and infrastructure as code (IaC).
- Troubleshoot and resolve issues related to system performance, outages, or slowdowns in real time.
- Conduct performance reviews and load testing to identify bottlenecks and scalability issues.
- Build disaster recovery strategies, including backups and failover mechanisms, to ensure system resilience.
- Optimize cost efficiency in cloud environments while maintaining high standards for system performance.
- Ensure proper logging, monitoring, and alerting across all critical systems and services.
- Develop playbooks and runbooks for incident management and response to streamline operations.
- Work closely with the development teams to implement best practices for building reliable systems.
Required Skills & Qualifications
- Proven experience as a Site Reliability Engineer (SRE), DevOps Engineer, or in a similar role.
- Deep understanding of cloud infrastructure (AWS, Google Cloud, or Azure) and scalability.
- Experience with container orchestration tools such as Docker and Kubernetes.
- Proficiency in Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation.
- Strong scripting skills in Python, Bash, or similar languages for automation.
- Experience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or CloudWatch.
- Familiarity with CI/CD pipelines and automated deployment strategies.
- Strong understanding of networking, security best practices, and load balancing.
- Proven experience in troubleshooting complex systems in real-time and identifying root causes of outages.
- Knowledge of disaster recovery strategies and system resilience techniques.
- Excellent communication skills and the ability to work in a remote-first team.
Preferred Skills
- Experience with multi-cloud or hybrid cloud environments.
- Familiarity with microservices architecture and distributed systems.
- Experience with incident management and automated recovery systems.
- Understanding of performance optimization techniques for high-traffic websites.
- Knowledge of DNS management, CDN optimization, and web performance best practices.
Why Join RacerX.AI?
- Remote-first culture – work from anywhere and collaborate with a global team.
- Innovative projects – work on cutting-edge web performance optimization and infrastructure solutions.
- Professional growth – continuous learning and opportunities for career development in a fast-growing industry.
- Flexible working hours – work when you’re most productive and maintain a healthy work-life balance.
- Collaborative environment – work with a passionate team that’s committed to innovation and excellence.
If you’re passionate about building resilient, high-performing systems and ensuring the reliability of cutting-edge web platforms, apply now to join the RacerX.AI team!