HPC Platform Engineer
Millennium Management
The firm is developing a cutting-edge high-performance computing (HPC) platform to support our portfolio managers, developers, quantitative analysts, and data scientists, enabling seamless scaling of compute capabilities both on-premise and in the cloud. We seek a senior, hands-on engineer who is customer-focused and an advocate for customer-driven solutions. The ideal candidate will have a strong understanding of physical and cloud-based infrastructure, experience in automating infrastructure, and proficiency in service and infrastructure lifecycle management. They will engage with teams to understand their requirements, drive development for our HPC platforms, and collaborate with other teams for integration. The candidate should also have expertise in Linux systems administration, container orchestration, networking, security, and infrastructure-as-code. Experience integrating, testing, and optimizing the integration of HPC with storage and data platforms is also essential.
Principal Responsibilities
- Collaborate within a customer-focused team to design, develop, test, and deploy HPC infrastructure in alignment with business needs.
- Foster strong relationships with quantitative, software engineering, and data science teams to ensure the HPC Platforms effectively meet their requirements.
- Engage with business units to promote understanding and drive adoption of our HPC offerings.
Qualifications/Desired Skills
- Deep understanding of Linux operating systems, with substantial practical experience in performance tuning, specifically related to HPC workloads.
- Experience consulting with business units around the execution of HPC workloads
- Experience with HPC cluster schedulers, such as Slurm, Grid engine, MOAB, PBS
- Experience with dynamically scaling, partitioning, and resource management within HPC environments
- Experience with and a strong understanding of containers and container orchestration, Kubernetes, container runtimes, etc.
- Experience contributing to a shared code base, including infrastructure as code.
- Experience with configuration management and automation tools, such as Chef, Ansible, Salt, Packer
- Experience with building monitoring and alerting on logs and metrics
- Excellent written and verbal communications
- Excellent troubleshooting and analytical skills
- Self-starter able to execute independently, on a deadline, and under pressure