As a Site Reliability Engineer, you will be responsible for designing XCHG’s core infrastructure. We code our way out of operational problems. The SRE team is responsible for reliability, scalability, and automation while keeping an eye on latency, security, performance, and capacity. You’ll also be responsible for mentoring, sharing knowledge, and guiding the technical decisions of the team to set us up for long-term success, both as a product and as a team. As a key hire on a growing engineering team, you’ll have a deep impact on engineering practices and culture.
As a SRE you will:
- Deploy and maintain our infrastructure
- Document every action so your findings turn into repeatable actions–and then into automation.
- Improve the deployment process to make it as boring as possible.
- Debug production issues across services and levels of the stack.
- Own, maintain, and continuously improve all systems provided as a service, such as monitoring and data stores.
- Research and implement changes to increase site reliability and help us operate more efficiently.
Who we're looking for:
- You have an intense intellectual curiosity, both within technology and outside of technology, and a deep humility about how much there is to learn.
- You're focused, driven and can get challenging projects across the finish line.
- You're empathetic, patient and love to help your teammates grow.
- You have a rare ability to collaborate with and influence diverse engineering teams to improve the reliability, scalability and durability of their services.
- You have experience running apps in production and take engineering best practices seriously.
- You understand the value of great logging, proper monitoring, and error tracking.
- Fluent in Python
- Minimum of 3 years of industry software engineering or IT automation experience
- Familiar with the Ansible, Terraform, Docker and AWS or related tools.
- Solid grasp of Linux systems and networking concepts.
- The get-stuff-done type and you are excited to be part of different projects within the context of a growing startup.
- Previous operational responsibility for business critical production systems, in a collaborative environment.
- Strong testing background: experience building unit, integration, performance, and load tests
- You are familiar with running systems with a microservice-based architecture
- You have interacted with data persistence technologies such as Elasticsearch, MongoDB, Cassandra, Kafka, or Redis
- You have written software