NOTE: This position is "hybrid" 3 days a week onsite in our Plano, Texas location. (this is NOT remote)
This is for a very high level Principal Site Reliability Engineer.
Join AT&T and reimagine the communications and technologies that connect the world. Our Consumer Technology experience team is delivering innovative and reliable technology solutions to power differentiated, simplified customer experiences. Bring your bold ideas and fearless risk-taking to redefine connectivity and transform how the world shares stories and experiences that matter. When you step into a career with AT&T, you won’t just imagine the future-you’ll create it.
The Principal System Engineering of Operations Tier 1 is responsible for helping lead a team of people dedicated to proactively ensuring high availability, reliability and resiliency of AT&T's customer & agent facing experiences and shared omnichannel platforms.
Responsibilities
- Provide 24x7 Tier 1 support for customer & agent facing applications operating across eCommerce, Care, & Retail platforms built on microservices based architecture on prem & in Cloud including SaaS: Salesforce, Salesforce Marketing Cloud, MuleSoft, etc.
- Management of escalated issues, incidents and outages, triage and driving prompt resolution
- Provide prompt visibility and status of escalated issues, incidents and outages to leadership, business partners and other key stakeholders.
- Responsible for Site Reliability Engineering aspects such as developing functional and technical knowledgebase of the application, creation of run books, developing observability of the application in terms of alerts, monitoring and dashboards that enable proactive incident and problem detection, triaging of the incidents and helping Tier 2 conduct blameless post-mortems (after action reviews).
- Oversee daily T1 operations of premise and hosted applications and experiences, including data centers, compute, storage, data networks, monitoring and NOC.
- Work with Release Management related to upcoming changes to production to identify risks and mitigate them.
- Work closely with Product Development & Tier 2 SRE teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized.
- Optimize the overall T1 on-call process and incident response workflow, including managing the team’s on-call rotation, alert rules, communication methods and incident response plans.
- Provide metrics and status reports and review with leadership and stakeholder communities; establish processes surrounding metrics gather, reporting and communication.
- Staying current on feature development and how it could affect the system’s overall reliability.
- Assist in developing, publishing and continually updating technology operations and support Standard Operating Procedures and detailed T1 documentation based on industry best practices.
- Provide technical leadership with great communication skills, with an ability to create and organize self-motivated team.
- Conduct rigorous due diligence on all plans.
- Drive team engagement. Motivate individuals and teams beyond current scope of influence.
- Champion and facilitate breakthrough solutions. Take appropriate, intelligent risks.
- Create, enable and cultivate a culture of responsibility and accountability.
- Lead by example and operate with transparency, integrity and respect.
Qualifications:
A suitable candidate for this position must possess the following applicable knowledge, skills and abilities. In addition, be able to demonstrate and provide applicable examples to support his/her competencies.
- Bachelor's degree in Computer Science or Engineering, or a related field
- 10+ years of demonstrated leadership experience building cross-organizational consensus
- 10+ years of demonstrated experience building and managing high-performing teams
- 10+ years of demonstrated experience with Incident Management, Incident response, and site reliability, managing Tier 1 Production Operations team
- 10+ Years of supporting large scale eCommerce, Care, & Retail POS platforms & supporting applications in production in a leadership capacity.
- Solid understanding and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
- Hands-on experience with Customer Experience Analytics & Session Based tools like Quantum Metric or Tealeaf
- Hands-on experience with Synthetic Monitoring tools like Catchpoint
- Experience working within scaled agile development team.
- Experience developing and implementing customer journey dashboards to enable proactive monitoring of customer experience availability.
- Experience designing and managing a world-class technical operations organization including 24x7 support and outage/incident management.
- Solid knowledge of Operations practices and demonstrated experience increasing Operational capability maturity within an organization.
- Excellent communication and presentation skills; the ability to present complex technical information in a clear and concise manner.
- Proficient at analyzing and interpreting large amounts of data with the capacity to synthesize information and translate into effective and actionable insights.
- Exceptional organization and planning skills, strong analytical abilities, and process-driven orientation
- Unrelenting sense of customer-focus, urgency and accuracy with an execution mindset
- Self-starter, creative, enthusiastic, innovative and collaborative outlook
Primary technical skills should include:
- Java, Spring, WebLogic, AKS, and CI/CD tools, PL/SQL
- Microservices based architecture using Java, J2EE, Jenkins, Maven, Linux, K8s, on both on-prem and in cloud.
- Docker, Kubernetes and Microsoft Azure Cloud, Unix
- Relational & NoSQL databases like Oracle & Cassandra
- Experience with visualization tools like Kibana and Grafana. EFK stack experience preferred.: (Hands-on experience is must)
- Creation of Dashboards on Dynatrace, ELK and Grafana. (Hands-on experience is must)
Secondary technical skills (optional, yet highly desirable):
- Salesforce Development (Apex, Visualforce, Lightning), Salesforce Sales Cloud & Service Cloud, MuleSoft, Dynatrace and ELK (Elastic, Logstash, Kibana) for monitoring and logging
- Hands on experience supporting Salesforce applications
- Sales & Service Cloud
- Experience with Marketing Cloud
- Experience within high tech, software and/or wireless/telecom industry highly desired
- Understanding of integration technologies and API Gateway, Mobile and iOS technology stack; Experience with MuleSoft desired
- Solid technical background with understanding and/or experience in software development, web technologies and customer communications such as email, SMS and push notification.
Our Principal System Engineering, earn between $158,200 - $237,400. Not to mention all the other amazing rewards that working at AT&T offers. Individual starting salary within this range may depend on geography, experience, expertise, and education/training.