• Cadence
  • Technical
How to Recover Canary Workflows During a Cadence® Version Upgrade

The Cadence® Canary tool is designed to regularly conduct health checks on Cadence services, including the frontend, matching, history, and worker services. However, during the test of upgrading a Cadence cluster from version 0.22.4 to 0.24.0, we encountered failures in Canary workflows post-upgrade.

This article details our investigative findings and provides a comprehensive guide on how to recover Canary workflows and navigate Cadence version upgrades up to 0.24.0. 

Step 1: Update Database Schema 

According to Cadence’s official guidelines, the initial step in upgrading Cadence servers involves updating the database schemas. Our managed Cadence clusters depend on managed Cassandra clusters; therefore, this guide will focus on the schema update procedures for Cassandra. The updates are conducted using the Cadence Cassandra tool, and the following outlines the necessary steps for schema updates up to Cadence version 0.24.0. 

To update the schema of the default keyspace, first, install the Cadence Cassandra tool. Then execute the command below: 

For upgrading the schema associated with the visibility keyspace, use the following command: 

Step 2: Upgrade Cadence Server Version 

Following the schema update, the subsequent phase involves deploying the new version of Cadence server across all nodes within the cluster. On our managed platform, this deployment is executed through an automated rolling restart. This process entails using Docker to fetch the latest Cadence image and restarting the Cadence services on each node. 

Upon completion of the Cadence server version rollout, a noticeable change may be the absence of Cadence Canary metrics. The following graph illustrates the count per second for the metric:

Looking at the logs of the Canary service, you may find that it is failing with the error of

Step 3: Cancel/Terminate Open Canary Workflows 

To address this error, we must initially halt the Canary service on each node within the Cadence cluster. Depending on your setup, you may want to use

or your selected service manager to stop the Canary service. 

Subsequently, it’s necessary to cancel or terminate all active Canary workflows using the Cadence CLI tool. 

To enumerate all ongoing Canary workflows, employ the command below. It will produce a list of workflow IDs prefixed with

as depicted in the subsequent screenshot. 

Proceed to cancel each Canary workflow individually using: 

Should there be any persistent workflows of the type

they may require termination:

Step 4: Restart Canary Workflows

After ensuring that all active Canary workflows have been canceled or terminated, confirmed by the list command yielding no results, it is then secure to restart the Canary service on each node.

Following this restart, the Canary service is expected to emerge from its failed state, resulting in the restoration of the Canary metrics, as demonstrated in the graph below:

Conclusion 

Upgrading Cadence requires careful attention to detail, especially when it comes to maintaining the integrity of essential workflows. By following the steps outlined above, you can ensure that your Canary workflows are recovered and continue to provide valuable insights into the health and performance of your Cadence cluster after the version upgrade from 0.22.4 to 0.24.0. 

If you prefer to bypass the complexity of managing these processes yourself, consider Instaclustr’s Managed Cadence service. Sign up today and experience it yourself with a free trial!