Distributed Systems Engineer – zCLOUD

Build and operate scalable, fault-tolerant distributed systems powering the zCLOUD platform.

What You’ll Do

  • Design and build distributed systems powering the zCLOUD platform
  • Develop control planes for GPU and resource orchestration
  • Build scalable, fault-tolerant services for multi-tenant environments
  • Optimize system reliability, latency, and throughput at scale
  • Collaborate with GPU systems and research teams on integration
  • Implement observability, monitoring, and debugging capabilities
  • Own services from design through production operation
  • Document system architecture and operational best practices

You’ll Thrive Here if You

  • Have 5+ years of experience building distributed systems
  • Possess a strong understanding of concurrency, consistency, and fault tolerance
  • Have experience designing APIs and service-oriented architectures
  • Are comfortable operating and debugging systems in production
  • Demonstrate a strong ownership mindset and engineering rigor

Bonus Qualifications

  • Experience with cloud platforms or large-scale infrastructure
  • Familiarity with scheduling or resource management systems
  • Experience supporting AI or data-intensive workloads
  • Background in platform or control-plane engineering

Why This Role is Unique

You will help build the backbone of a new cloud paradigm designed specifically for GPU-driven AI workloads.

Details

  • Competitive salary and equity based on experience and skill set
  • Flexible work environment
  • Applicants must be authorized to work in their respective location

Interested in This Position?