Sr. Software Engineer (Data Center Automation)

xAI ↗Memphis, Shelby County, West Tennessee, Tennessee, United StatesFull-timeFeatured

Published 3 days ago

About the company xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. The team is small, highly motivated, and operates with a flat organizational structure where all employees contribute directly and hands-on to the mission. About the role This role focuses on automating processes, building observability solutions, and ensuring seamless operations for mission-critical AI infrastructure across a multi-data center environment. The ideal candidate bridges software engineering principles with physical data center realities to reduce MTTR, minimize downtime, and drive infrastructure resilience and scalability. Responsibilities - Design, develop, and deploy scalable services in Python and Rust to automate reliability workflows: monitoring, alerting, incident response, and infrastructure provisioning - Implement and maintain observability tools — metrics collection, logging, tracing, and dashboards — for real-time system health insights across multiple data centers - Collaborate with cross-functional teams (software, network, site/facility operations, mechanical/electrical) to automate fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation - Troubleshoot complex issues including hardware failures, environmental anomalies, software bugs, and network problems while applying error budgets and SLAs - Optimize Linux-based systems for performance, security, and reliability including kernel tuning, container orchestration (Kubernetes), and automation scripting - Participate in on-call rotations, blameless postmortems, and joint physical failover exercises with facility teams - Mentor junior engineers and document processes to foster a culture of automation and knowledge sharing Requirements - Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience - 3+ years in SRE, infrastructure engineering, DevOps, or systems engineering in large-scale or distributed environments - Strong Python skills (required); solid fundamentals in at least one systems-level language (Go, C++, or Rust) - Solid Linux systems administration experience: performance tuning, kernel-level understanding, and production scripting - Practical experience with containerization and orchestration (Docker, Kubernetes or similar) - Experience implementing observability solutions: Prometheus, Grafana, or alternatives - Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in multi-site environments Nice to have - 5+ years in SRE or infrastructure roles in hyperscale, cloud, or AI/ML training environments - Proficiency in Rust for systems programming and performance-critical components - Direct experience integrating software reliability tools with physical data center infrastructure (power, cooling, environmental monitoring) - Background optimizing Linux systems for AI workloads, GPU clusters, or high-throughput compute - Experience with bare-metal provisioning, data center interconnects, or hybrid/multi-site failover mechanisms - Prior work building automated remediation, fault tolerance, or predictive failure detection systems Benefits & perks - Not specified Compensation Salary not specified. xAI is an equal opportunity employer; no equity, bonus, or benefits details provided in the posting.

Apply for this role

Share:LinkedIn X Threads