Quick MPI Cluster Setup on Azure

This page is still a work in progress because the notes and inputs/outputs are not well explained yet.

These instructions show the most basic way to create a cluster that's capable of running MPI across the general-purpose network in Azure. This is meant to be a simple illustration of how to do it, using the most basic steps, to show what the process looks like. This is not how you would create a production HPC cluster for real; CycleCloud and similar tools automate and simplify most of this, but my goal here is to show what is happening underneath those automations so you can play around.

We will use the Azure CLI to do everything and assume you have already set that up with your Azure account and subscription.

Create compute nodes

On your local machine (same place you'll run your az commands), create a file called cloud-init.txt that contains this:

#cloud-config
package_upgrade: true
packages:
  - clustershell
  - openmpi-bin
  - libopenmpi-dev

We'll pass this cloud-init file into Azure's VM provisioning process to preinstall clush (which lets us run the same command across all our nodes) and OpenMPI (which we need to build and run MPI applications) as part of the VM provisioning process. If you want to get fancy, you can also jam an entire bash script into this text file to have it run verbatim upon VM boot.

The first Azure resource we create is a resource group. This is just a logical container that will group all our cluster parts.

$ az group create --name glocktestrg --location eastus

Once this resource group is created, we can start creating compute nodes inside it. There are two ways to do this: the cheap/easy way and the proper/expensive way.

Cheap (Ethernet) nodes

Create a proximity placement group (ppg). Every VM we put in here will be within the same low-latency network bubble.

$ az ppg create --name glocktestppg \
               --resource-group glocktestrg \
               --intent-vm-sizes Standard_DS1_v2

Now we create a group of four compute nodes in that ppg. The Azure CLI makes this pretty easy nowadays.

$ az vm create --name glocluster \
             --resource-group glocktestrg \
             --image UbuntuLTS \
             --ppg glocktestppg \
             --generate-ssh-keys \
             --size Standard_DS1_v2 \
             --accelerated-networking true \
             --custom-data cloud-init.txt \
             --count 4

What this means:

the nodes will be named glocluster0, glocluster1, glocluster2, glocluster3,
--image UbuntuLTS installs Ubuntu on each VM. This aliases to an old (18.04) version; if you want something else, use az vm image list -p canonical -o table --all to find all Ubuntu (canonical) images. For example, Canonical:0001-com-ubuntu-server-jammy:22_04-lts-gen2:latest will use Ubuntu 22.04 instead.
--generate-ssh-keys puts your local ssh key ~/.ssh/id_rsa.pub into the authorized_keys file in all the nodes you're creating
--size Standard_DS1_v2 says to use DS1_v2 VM types
--accelerated-networking true says to pass the NIC through to your VM using SR-IOV. It doesn't cost anything, so I don't really know why you would not want this.
--custom-data cloud-init.txt passes in our cloud-init.txt file to the provisioning process
--count 4 says to make four VMs total.

This command will block until all VMs are up and running. Then, we can see what all was created using:

$ az resource list --resource-group glocktestrg -o table

Name                   ResourceGroup    Location    Type                                          Status
---------------------  ---------------  ----------  --------------------------------------------  --------
glocktestppg           glocktestrg      eastus      Microsoft.Compute/proximityPlacementGroups
gloclusterPublicIP3    glocktestrg      eastus      Microsoft.Network/publicIPAddresses
gloclusterNSG          glocktestrg      eastus      Microsoft.Network/networkSecurityGroups
gloclusterPublicIP2    glocktestrg      eastus      Microsoft.Network/publicIPAddresses
...

The VMs created all have public IPs and share a common subnet on a common VNet within close physical proximity thanks to our proximity placement group.

Proper (InfiniBand) nodes

InfiniBand is only available on HPC (read: expensive) VM types, and provisioning nodes on the same InfiniBand fabric requires creating a VM Scale Set (VMSS) instead of individual VMs. To create a VMSS of HBv2 nodes with InfiniBand,

az vmss create --name glocluster \
               --resource-group glocktestrg \
               --image Microsoft-dsvm:ubuntu-hpc:2204:latest \
               --generate-ssh-keys \
               --vm-sku Standard_HB120rs_v2 \
               --accelerated-networking true \
               --public-ip-per-vm \
               --custom-data cloud-init.txt \
               --instance-count 4

What this means:

the nodes' name will be prefixed with glocluster followed by some hexadecimal number to uniquely identify them.
--image Microsoft-dsvm:ubuntu-hpc:2204:latest installs the special Azure HPC flavor of Ubuntu on each VM. This VM image includes all the InfiniBand drivers.
--generate-ssh-keys puts your local ssh key ~/.ssh/id_rsa.pub into the authorized_keys file in all the nodes you're creating.
--size Standard_HB120rs_v2 says to use HBv2 VM types.
--accelerated-networking true says to pass the NIC through to your VM using SR-IOV. It doesn't cost anything, so I don't really know why you would not want this.
--public-ip-per-vm assigns a public IP address to each node. This makes it easier for you to connect to them in this example, but each public IP does cost money.
--custom-data cloud-init.txt passes in our cloud-init.txt file to the provisioning process.
--instance-count 4 says to make four VMs total to start. You can manually scale this up or down with a single command (az vmss scale) later.

VMSSes are nice because they essentially establish a VM template from which VMs can be instantiated and torn down with a single command. In combination with a well-formed cloud-init script, you can get a VMSS set up just the way you like it, then spin up a whole cluster (using az vmss scale) and have nodes just come online and be ready to use. At the end of the day, you can then scale the VMSS down to zero nodes to stop racking up costs without losing your VMSS configuration. When it's time to spin up, you just use az vmss scale again to go from zero nodes to however many you want to play with.

Log into nodes

At this point you should have a cluster of running VMs.

SSH to your cluster

First, list the public and private IP addresses of your cluster. We'll need to SSH to one of the nodes using its public IP address:

$ az vm list-ip-addresses --resource-group glocktestrg -o table

VirtualMachine    PublicIPAddresses    PrivateIPAddresses
----------------  -------------------  --------------------
glocluster0       20.25.28.201         10.0.0.7
glocluster1       20.25.25.225         10.0.0.6
glocluster2       20.25.24.166         10.0.0.4
glocluster3       20.169.149.181       10.0.0.5

The az vm create command created a user for you on your VMs with the same name as your local user (the one who ran the az command; in my case, glock, but you can override this using --admin-username). It also puts your local ssh key (~/.ssh/id_rsa.pub) in each VM's authorized_keys file. So at this point, you should be able to log into your head node:

$ ssh 20.25.28.201
$ exit

Copy your private ssh key from your local laptop to remote node so you can then ssh from your login node to your other nodes.

$ scp ~/.ssh/id_rsa 20.25.28.201:.ssh/

Note this is very bad security practice that you shouldn't do unless you're messing around, since you shouldn't be sharing private keys between your home computer and this cluster (think: if someone breaks into our cloud cluster, they have the same key you use at home).

Configure inter-host connectivity

These VMs can already resolve each others' hostnames thanks to Azure magic, so there's no need to add glocluster0, glocluster1, etc to /etc/hosts.

To get passwordless ssh up and working (so we can noninteractively run commands on other nodes through SSH), first add all the compute nodes' host keys to your SSH known_hosts file:

$ for i in {0..3};do ssh-keyscan glocluster${i};done > ~/.ssh/known_hosts

The ssh-keyscan command just connects to another hosts and retrieves its host keys, which we then store in ~/.ssh/known_hosts on our main glocluster0 node.

Now the clush command (which we installed via cloud-init) will work and don't have to keep using bash for loops to do something on all our nodes. Make sure clush works across all nodes:

$ clush -w glocluster[0-3] uname -n

Then copy your known_hosts and private key to all cluster nodes using clush -c:

$ clush -w glocluster[0-3] -c ~/.ssh/id_rsa ~/.ssh/known_hosts

Now we have passwordless ssh working between all our nodes, and we are ready to run an MPI application!

Run MPI hello world

Create a file called hello.c and stick the following code in it:

#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main (int argc, char **argv)
{
    int mpi_rank, mpi_size;
    char host[1024];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

    gethostname(host, 1024);
    printf("Hello from rank %d of %d running on on %s\n", 
        mpi_rank, 
        mpi_size, 
        host);

    MPI_Finalize();
    return 0;
}

Compile the above using mpicc hello.c -o hello.

Then copy it to all your nodes:

$ clush -w glocluster[0-3] -c hello

Now make sure MPI works:

$ mpirun --host glocluster0,glocluster1,glocluster2,glocluster3 ./hello

You might get some stupid warnings about openib like this:

--------------------------------------------------------------------------
[[50939,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: glocluster3

It's harmless, but you can make these go away by explicitly telling OpenMPI to only use TCP, shared memory (vader), and application memory:

$ mpirun --host glocluster0,glocluster1,glocluster2,glocluster3 --mca btl tcp,vader,self ./hello

If you want to do some basic performance testing, try the OSU Micro-Benchmarks. First build them:

$ sudo apt -y install make g++

$ wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-6.1.tar.gz

$ tar zxf osu-micro-benchmarks-6.1.tar.gz

$ cd osu-micro-benchmarks-6.1/

$ ./configure --prefix $PWD/install CC=mpicc CXX=mpicxx

$ make && make install

Copy the compiled app everywhere:

$ cd
$ clush -w glocluster[0-3] -c osu-micro-benchmarks-6.1

Then try measuring latency!

$ mpirun -np 2 --host glocluster1,glocluster2 ./osu-micro-benchmarks-6.1/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency

# OSU MPI Latency Test v6.1
# Size          Latency (us)
0                      31.28
1                      27.95
2                      26.76
4                      28.75

Create a shared file system

Using clush to copy your binaries and files to your home directory on every compute node is really tedious. An easier way to manage data is to put this shared data in a shared network file system. Here we show how to do this the easy way (just export a directory from your head node) and cool way (mount an object container using NFS).

In both cases, first install the NFS client packages and create the directory where we'll mount this NFS volume (/scratch):

# install the nfs client and server on all nodes
$ clush -w glocluster[0-3] sudo apt -y install nfs-common nfs-kernel-server

# create the shared mountpoint on all nodes
$ clush -w glocluster[0-3] sudo mkdir /scratch

Exporting from a node

The simplest way to share data via NFS is by exporting a directory from one of your nodes (like the main node) and mounting it across all nodes. This is not scalable, but it's good enough for messing around.

First create a local directory on our primary node called /shared with the goal to mount it everywhere as the /scratch directory:

# make the directory we're going to share and make sure we can write to it
$ sudo mkdir /shared
$ sudo chown glock:glock /shared

# share the directory - limit it only to our subnet (10.0.*) though since all our nodes have public IP addresses
$ sudo bash -c 'echo "/shared 10.0.*(rw,no_root_squash,no_subtree_check)" >> /etc/exports'
$ sudo exportfs -a

Remember that our main node's IP address is 10.0.0.7 (if you forget, just check using ip addr list)

$ clush -w glocluster[0-3] sudo mount -t nfs -o vers=3,rsize=1048576,wsize=1048576,nolock,proto=tcp,nconnect=8 10.0.0.7:/shared /scratch

Make sure it works:

$ touch /scratch/hello

$ clush -w glocluster[0-3] ls -l /scratch

glocluster0: total 0
glocluster0: -rw-rw-r-- 1 glock glock 0 Sep 25 23:24 hello
glocluster3: total 0
...

Using Blob NFS

Create a slow storage account:

$ az storage account create --name glockteststorage \
                          --resource-group glocktestrg \
                          --sku Standard_LRS \
                          --enable-hierarchical-namespace true \
                          --enable-nfs-v3 true \
                          --default-action deny

Or an SSD-backed storage account:

$ az storage account create --name glockteststorage \
                          --resource-group glocktestrg \
                          --kind BlockBlobStorage \
                          --sku Premium_LRS \
                          --enable-hierarchical-namespace true \
                          --enable-nfs-v3 true \
                          --default-action deny

Blob NFS would be super insecure since it uses NFS v3 and relies on access control via the network which is decidedly un-cloudlike since storage accounts otherwise are meant to be accessed from the public Internet. This requires us to set --default-action deny to disallow all traffic by default and only allow traffic from specific subnets. Let's add ours:

You will need to get the name of the subnet of the vnet you just created:

$ az network vnet subnet list --resource-group glocktestrg --vnet-name gloclusterVNET -o table

AddressPrefix    Name              PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  ----------------  --------------------------------  -----------------------------------  -------------------  ---------------
10.0.0.0/24      gloclusterSubnet  Disabled                          Enabled                              Succeeded            glocktestrg

# enable the service endpoint for Azure Storage on our subnet
$ az network vnet subnet update --resource-group glocktestrg \
                              --vnet-name gloclusterVNET \
                              --name gloclusterSubnet \
                              --service-endpoints Microsoft.Storage

# add a network rule to the storage account allowing access from our vnet
$ az storage account network-rule add --resource-group glocktestrg \
                                    --account-name glockteststorage \
                                    --vnet-name gloclusterVNET \
                                    --subnet gloclusterSubnet

# confirm that access is now allowed
$ az storage account network-rule list --resource-group glocktestrg \
                                     --account-name glockteststorage \
                                     --query virtualNetworkRules

You will probably also have to add your home IP address (or whatever IP address Azure will see when you try to connect directly to it; try using the address that shows up when you run who on your cluster) as a rule:

$ az storage account network-rule add --resource-group glocktestrg \
                                    --account-name glockteststorage \
                                    --ip-address 136.24.220.92

Then you can create a storage container which will also act as your NFS export:

$ az storage fs create --name glocktestscratch \
                     --account-name glockteststorage \
                     --auth-mode login

That --auth-mode bit uses the same credentials the rest of your az commands are using. It's there because, unlike the other az commands we've been running, az storage fs create is actually talking to the Azure Storage data plane and using the same interface you'd use if you were reading or writing data.

Anyway, with this container now created, get our service endpoint:

$ az storage account show --resource-group glocktestrg \
                          --name glockteststorage \
                          --query primaryEndpoints

{
  "blob": "https://glockteststorage.blob.core.windows.net/",
  "dfs": "https://glockteststorage.dfs.core.windows.net/",
  "file": "https://glockteststorage.file.core.windows.net/",
  "internetEndpoints": null,
  "microsoftEndpoints": null,
  "queue": "https://glockteststorage.queue.core.windows.net/",
  "table": "https://glockteststorage.table.core.windows.net/",
  "web": "https://glockteststorage.z13.web.core.windows.net/"
}

Then from our cluster, we can mount it:

$ clush -w glocluster[0-3] sudo mount -t nfs -o vers=3,rsize=1048576,wsize=1048576,hard,nolock,proto=tcp,nconnect=8 glockteststorage.blob.core.windows.net:/glockteststorage/glocktestscratch /scratch

The container gets mounted as nobody and cannot be accessed by anyone, but we can open it up with root and create a user directory in there:

$ sudo chmod o+rx /scratch

$ sudo mkdir /scratch/glock
$ sudo chown glock:glock /scratch/glock

Now we can work in it as a regular user:

$ touch /scratch/glock/hello

$ clush -w glocluster[0-3] ls -l /scratch/glock/hello
glocluster0: -rw-rw-r-- 1 glock glock 0 Sep 26 00:14 /scratch/glock/hello
glocluster1: -rw-rw-r-- 1 glock glock 0 Sep 26 00:14 /scratch/glock/hello

Deleting public IPs of non-primary nodes

Note we do NOT start at node0 - we need that public IP to connect in!

for i in {1..3}; do
    az network nic ip-config update --resource-group glocktestrg \
                                    --name ipconfigglocluster${i} \
                                    --nic-name gloclusterVMNic${i} \
                                    --remove PublicIpAddress
    az network public-ip delete --resource-group glocktestrg \
                                --name gloclusterPublicIP${i}
done