Site Reliability Engineer

Prodigy Resources • seattle, wa, us • 3m ago

About Us:

Prodigy is seeking an SRE to join our client's organization which is leading the charge in fintech innovation, providing state-of-the-art solutions that drive financial success and empower our clients. As they embark on an exciting Greenfield project, they're seeking an experienced Site Reliability Engineer to join their team. This role is critical to ensuring the reliability, scalability, and performance of their systems as we build and deploy new technologies.

Role Overview:

We are looking for a talented Site Reliability Engineer with a strong background in Python, Django, Flask, and AWS. In this role, you will be pivotal in maintaining the uptime, performance, and reliability of our systems. You'll work closely with development teams to integrate reliability engineering practices into the software development lifecycle, ensuring our services meet high standards for reliability and performance.

Key Responsibilities:

System Reliability: Ensure the availability, performance, and scalability of backend services and APIs. Develop and implement reliability engineering practices and tools.
Incident Management: Respond to and resolve incidents, perform root cause analysis, and implement preventive measures to avoid future issues.
Monitoring & Metrics: Set up and manage monitoring, logging, and alerting systems using tools such as AWS CloudWatch, and ensure comprehensive visibility into system performance.
Automation: Automate operational tasks and processes to improve efficiency and reduce manual intervention. Develop and maintain CI/CD pipelines.
Capacity Planning: Work on capacity planning and performance tuning to handle increasing loads and ensure system resilience.
Collaboration: Collaborate with development teams to design, deploy, and manage infrastructure and applications. Provide guidance on reliability best practices and performance optimizations.
Documentation: Create and maintain documentation for systems, processes, and incident response procedures.
Continuous Improvement: Stay updated with industry trends and emerging technologies to continuously improve our reliability and performance practices.

Qualifications:

Experience: Minimum of 5 years of experience in Site Reliability Engineering or a related field, with a solid background in Python, Django, Flask, and AWS.
Technical Skills: Proficiency in Python and experience with Django and Flask frameworks. Hands-on experience with AWS services (EC2, S3, RDS, Lambda, etc.).
Reliability Practices: Strong understanding of SRE principles, including SLAs, SLOs, and error budgets. Experience with incident management and disaster recovery.
Monitoring Tools: Experience with monitoring and observability tools (e.g., AWS CloudWatch, Prometheus, Grafana).
Automation: Proven experience in automating tasks and managing CI/CD pipelines.
Problem-Solving: Excellent analytical and troubleshooting skills, with the ability to resolve complex technical issues.
Communication: Strong verbal and written communication skills, with the ability to convey technical concepts to both technical and non-technical audiences.
Fintech Experience: Experience in the fintech industry or similar regulated environments is highly desirable.

Why Join Us?

Innovative Projects: Contribute to a transformative Greenfield project that will shape the future of fintech.
Dynamic Environment: Engage in a fast-paced, collaborative environment focused on continuous improvement and innovation.
Growth Opportunities: Access to ongoing learning and career development opportunities.
Competitive Compensation: Enjoy a competitive salary and comprehensive benefits package.