How do I know when my VM is ready to connect

Posted on

Previously, we looked at how Windows VM initialization works on Google Cloud and why it takes around two minutes before a Windows VM is ready. Because VM initialization takes its time, gcloud and the Compute Engine APIs do not block until VM initialization is complete – instead, they return (almost) immediately.

This asynchronous behavior can create a challenge if you try to automate VM creation. In an automation script, you might need to know when the initialization has completed so that you can connect to the VM or initiate the next deployment steps. There are a few ways to determine when a VM is ready, so let us explore what these are.

Liveness probe

The first way to test if a VM is ready is to not worry about the VM itself, but to wait for the deployed application to become ready.

If the purpose of the VM is to run a certain application or service, then a common practice is to use a specialize or startup script to automatically install the application and to start it on boot. If the application happens to be a TCP server, then one way to test if the application is ready is to simply try to connect to the respective TCP port. If a connection can be established, the application (and thus, the VM) must be ready; if the connection fails, you wait a few seconds and try again.

If the application happens to be an HTTP server, you can increase the reliability and accuracy of the approach by invoking the server’s health check endpoint instead of merely probing the TCP port.

A key advantage of this approach is that it is rather straightforward to implement and that it measures what we are actually interested in, i.e. that the application is ready. It also happens to mirror what Kubernetes does with its liveness probes.

Unfortunately, there are certain situations where implementing the approach either does not work at all or requires you to compromise on security: For a deployment script to perform a liveness probe, it must be able to directly connect to the respective TCP port of the VM – but that is often not possible: If the service is exposed over the internet and you try to perform the check from outside the VPC, then odds are that there is a load balancer between your deployment script and the VM. With a load balancer in between, you cannot be sure anymore that it is really the right VM that is receiving your requests.

If you perform the checks from inside the VPC, then load balancers are not a problem. But within the VPC, probes are subject to firewall rules. First, that means that you might have to create or relax certain firewall rules to allow a deployment script to communicate directly with the respective VMs. Second, if a probe fails, you cannot be quite sure whether the failure occurred because the application is not ready or whether it is because there is something wrong with your firewall rules.

Probing the RDP port

Not every VM runs an HTTP server or application that could be probed using the liveness probes approach. But every Windows VM (by default, anyway) runs Remote Desktop – so as a variation of the previous approach, you can simply probe port 3389 and use that as indicator for whether the VM is ready.

Again, a key advantage of this approach is that it is rather straightforward to implement and does not require any special configuration on the VM itself. But there are a few extra complications that apply to RDP.

First, there is the security concern of exposing RDP to the internet. RDP does not have a particularly strong security record, so the best practice is to allow RDP connections from within your VPC. To connect from outside, you can either use Cloud IAP TCP tunneling or you can connect your on-premises network to the VPC by using Cloud VPN or Interconnect.

Probing an RDP port over Cloud IAP TCP tunneling is possible, but not easy to do as it requires extra tooling to establish a tunnel. In practice, the only option to probe the RDP port is to do so from within your VPC.

Second, Remote Desktop starts accepting connections before startup scripts are guaranteed to have been completed. If you rely on startup scripts, then this behavior might cause you to assume that a VM instance is ready before it is actually ready.

Observing the serial port

The limitations of using liveness probes or probing the RDP port are ultimately caused by the fact that these approaches rely on the data plane to determine the state of the VM instance.

One way to observe the state of a VM instance by using the control plane is to read its serial port output. All boot messages and output generated by the guest environment are emitted to serial port 1, which you read from by running the following command:

gcloud compute instances get-serial-port-output my-instance

When a Windows VM is ready, it writes the following message to the log:

2020/02/28 12:50:14 GCEInstanceSetup: ------------------------------------------------------------
2020/02/28 12:50:14 GCEInstanceSetup: Instance setup finished. my-instance is ready to use.
2020/02/28 12:50:14 GCEInstanceSetup: ------------------------------------------------------------

Another way to determine whether a Windows VM is ready therefore is to keep reading from the serial port (using the instances.getSerialPortOutput API) until you encounter the message above.

A key advantage of this approach is that it is more accurate than the previous two approaches. Once the message appears, all startup scripts have completed, so the VM is definitely ready.

However, a key disadvantage of relying on the Instance setup finished message to determine whether a VM is ready is that you are relying on an implementation detail of the Windows guest environment that could change at any time.

Guest attributes

Arguably the best and most reliable way to determine when a VM instance is ready is to use guest attributes. You can think of guest attributes as being a complement to instance metadata: Instance metadata is essentially input to the VM and is commonly used to pass parameters to applications deployed on a VM. Guest attributes are more like output – code running on the VM can publish guest attributes that code running outside the VM (such as an automation script) can read and consume.

Using a guest attribute to determine when a VM is ready is a three-step process:

  1. Start the VM with guest attribute support enabled.
  2. Run a startup script that sets a certain guest attribute. On Linux, you can use curl for that:

    curl -X PUT \
        --data "true" http://metadata.google.internal/computeMetadata/v1/instance/guest-attributes/vm/ready \
        -H "Metadata-Flavor: Google"
    

    On Windows, use PowerShell:

    Invoke-RestMethod `
        -Headers @{'Metadata-Flavor'='Google'} `
        -Method PUT `
        -Uri "http://metadata.google.internal/computeMetadata/v1/instance/guest-attributes/vm/ready" `
        -Body true
    
  3. In the automation script, poll until the guest attribute shows up:

    until gcloud compute instances get-guest-attributes instance-1 \
        --zone=$(gcloud config get-value compute/zone) \
        --query-path=vm/ready > /dev/null 2>&1
    do
        sleep 5 && echo waiting for VM to boot...
    done
    
    

Running user-provided startup scripts is the very last thing that happens during VM initialization, so you can be sure that VM initialization has completed when the guest attribute is visible.

Using guest attributes has all of the advantages of observing the serial port – it only relies on control plane features and is very accurate. But unlike the previous approach, you are not relying on undocumented behavior.

« Back to home