Principal Site Reliability Engineer

AT&T • plano, tx, us • 3m ago

NOTE: This position is "hybrid" 3 days a week onsite in our Plano, Texas location. (this is NOT remote)

This is for a very high level Principal Site Reliability Engineer.

Join AT&T and reimagine the communications and technologies that connect the world. Our Consumer Technology experience team is delivering innovative and reliable technology solutions to power differentiated, simplified customer experiences. Bring your bold ideas and fearless risk-taking to redefine connectivity and transform how the world shares stories and experiences that matter. When you step into a career with AT&T, you won’t just imagine the future-you’ll create it.

The Principal System Engineering of Operations Tier 1 is responsible for helping lead a team of people dedicated to proactively ensuring high availability, reliability and resiliency of AT&T's customer & agent facing experiences and shared omnichannel platforms.

Responsibilities

Provide 24x7 Tier 1 support for customer & agent facing applications operating across eCommerce, Care, & Retail platforms built on microservices based architecture on prem & in Cloud including SaaS: Salesforce, Salesforce Marketing Cloud, MuleSoft, etc.
Management of escalated issues, incidents and outages, triage and driving prompt resolution
Provide prompt visibility and status of escalated issues, incidents and outages to leadership, business partners and other key stakeholders.
Responsible for Site Reliability Engineering aspects such as developing functional and technical knowledgebase of the application, creation of run books, developing observability of the application in terms of alerts, monitoring and dashboards that enable proactive incident and problem detection, triaging of the incidents and helping Tier 2 conduct blameless post-mortems (after action reviews).
Oversee daily T1 operations of premise and hosted applications and experiences, including data centers, compute, storage, data networks, monitoring and NOC.
Work with Release Management related to upcoming changes to production to identify risks and mitigate them.
Work closely with Product Development & Tier 2 SRE teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized.
Optimize the overall T1 on-call process and incident response workflow, including managing the team’s on-call rotation, alert rules, communication methods and incident response plans.
Provide metrics and status reports and review with leadership and stakeholder communities; establish processes surrounding metrics gather, reporting and communication.
Staying current on feature development and how it could affect the system’s overall reliability.
Assist in developing, publishing and continually updating technology operations and support Standard Operating Procedures and detailed T1 documentation based on industry best practices.
Provide technical leadership with great communication skills, with an ability to create and organize self-motivated team.
Conduct rigorous due diligence on all plans.
Drive team engagement. Motivate individuals and teams beyond current scope of influence.
Champion and facilitate breakthrough solutions. Take appropriate, intelligent risks.
Create, enable and cultivate a culture of responsibility and accountability.
Lead by example and operate with transparency, integrity and respect.

Qualifications:

A suitable candidate for this position must possess the following applicable knowledge, skills and abilities. In addition, be able to demonstrate and provide applicable examples to support his/her competencies.

Bachelor's degree in Computer Science or Engineering, or a related field
10+ years of demonstrated leadership experience building cross-organizational consensus
10+ years of demonstrated experience building and managing high-performing teams
10+ years of demonstrated experience with Incident Management, Incident response, and site reliability, managing Tier 1 Production Operations team
10+ Years of supporting large scale eCommerce, Care, & Retail POS platforms & supporting applications in production in a leadership capacity.
Solid understanding and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
Hands-on experience with Customer Experience Analytics & Session Based tools like Quantum Metric or Tealeaf
Hands-on experience with Synthetic Monitoring tools like Catchpoint
Experience working within scaled agile development team.
Experience developing and implementing customer journey dashboards to enable proactive monitoring of customer experience availability.
Experience designing and managing a world-class technical operations organization including 24x7 support and outage/incident management.
Solid knowledge of Operations practices and demonstrated experience increasing Operational capability maturity within an organization.
Excellent communication and presentation skills; the ability to present complex technical information in a clear and concise manner.
Proficient at analyzing and interpreting large amounts of data with the capacity to synthesize information and translate into effective and actionable insights.
Exceptional organization and planning skills, strong analytical abilities, and process-driven orientation
Unrelenting sense of customer-focus, urgency and accuracy with an execution mindset
Self-starter, creative, enthusiastic, innovative and collaborative outlook

Primary technical skills should include:

Java, Spring, WebLogic, AKS, and CI/CD tools, PL/SQL
Microservices based architecture using Java, J2EE, Jenkins, Maven, Linux, K8s, on both on-prem and in cloud.
Docker, Kubernetes and Microsoft Azure Cloud, Unix
Relational & NoSQL databases like Oracle & Cassandra
Experience with visualization tools like Kibana and Grafana. EFK stack experience preferred.: (Hands-on experience is must)
Creation of Dashboards on Dynatrace, ELK and Grafana. (Hands-on experience is must)

Secondary technical skills (optional, yet highly desirable):

Salesforce Development (Apex, Visualforce, Lightning), Salesforce Sales Cloud & Service Cloud, MuleSoft, Dynatrace and ELK (Elastic, Logstash, Kibana) for monitoring and logging
Hands on experience supporting Salesforce applications
Sales & Service Cloud
Experience with Marketing Cloud
Experience within high tech, software and/or wireless/telecom industry highly desired
Understanding of integration technologies and API Gateway, Mobile and iOS technology stack; Experience with MuleSoft desired
Solid technical background with understanding and/or experience in software development, web technologies and customer communications such as email, SMS and push notification.

Our Principal System Engineering, earn between $158,200 - $237,400. Not to mention all the other amazing rewards that working at AT&T offers. Individual starting salary within this range may depend on geography, experience, expertise, and education/training.

Apply

Care Options For Kids | Occupational Therapist (OT)

Care Options For Kids • plano, tx, us • 19h ago

19h ago

Apply

Toyota North America | Technical Product Owner

Toyota North America • plano, tx, us • 19h ago

19h ago

Apply