To RTOS or Not To RTOS?

To use an RTOS for your embedded project, or Not! That is the question poor Yorick! I digress from my usual focus on Z-Wave to discuss the general topic of using a Real-Time Operation System (RTOS) for simple embedded IoT devices. The question is moot for Z-Wave since the protocol has FreeRTOS built-in starting with the release of the 700 series. For the moment at least, the choice is To RTOS!

What is an RTOS?

My focus in this post is on small IoT devices like sensors, dimmers, window shades, to more complex devices like thermostats and door locks. Using an RTOS for simple devices like these brings different requirements than say a full Operating System like Windows or Linux. The purpose of an Operating System (OS) is to provide common resources to an application – things like memory management and insulating the application from hardware . The term “Real-Time” comes from basic concept of dividing up the resources of an embedded system so that tasks are completed within a certain timeframe. A hard-real-time system is often used in demanding applications like Engine Control. The precise management of firing the spark plugs at exactly the proper microsecond is critical to the efficient operation of an internal combustion engine. But simple IoT devices have much lower demands on the RTOS and instead are attracted to the coding efficiency and standardization of an RTOS – this is often called a soft-RTOS. All this comes at a cost in CPU and memory resources so the question remains – is an RTOS worth it for simple IoT devices?

  • FreeRTOS Features:
    • Trusted Reliable Kernel
    • MultiTasking/MultiThreaded
    • Mailboxes, Mutexes, Queues
    • Modular Libraries
    • Broad Eco-System support – 40+ MCU architectures
    • Small Scalable kernel size with power saving modes
    • Complete online documentation
    • Long Term Stable Support – Active support Community
    • Completely Free Open Source project

Z-Wave History with FreeRTOS

In the beginning Z-Wave ran on an 8-bit MCU with limited FLASH and RAM which meant life without an RTOS due to CPU performance and memory limitations. The Z-Wave protocol was built on “Bare Metal” and thus interrupt driven with a tick-timer and drivers to provide basic services. The 700 series opened the world of a 32-bit RISC MCU and significantly more memory which enabled the use of an RTOS as the foundation of the Z-Wave protocol.

I was a Field Applications Engineer for Silicon Labs for several years and in that time I would guess easily half the bugs I came across were caused by the complexity of the RTOS. I don’t have any hard statistics but it certainly seemed that way to me! The Z-Wave protocol code was ported from a Bare-Metal implementation on an 8-bit CPU to a 32-bit ARM running FreeRTOS – a challenging port to say the least! The developers treated FreeRTOS like a black-box (which is the whole point of an RTOS) and often made small mistakes that turned into really difficult to debug problems. Things like: not checking when a queue is full, not using the *FromISR() version of various calls inside interrupt service routines, hidden stack overflows by not enabling overflow checking, incomplete configuration of the many, many, many options just to name a few. An RTOS adds a LOT of complexity but you get a lot of features. The developers have to be fully trained and understand the best practices for using the complexity of the RTOS to achieve a robust system.

My primary complaint with the current implementation is that it continues to be pre-compiled into the Z-Wave library. More and more of the configuration files and various parts of FreeRTOS have been moved out of the library and into source code with each SDK release. Moving the entire RTOS into source form is not exposing any proprietary code – after all, it’s open source! It would allow developers to more quickly move to newer releases of the RTOS and related libraries. Perhaps this will come as part of the Open Source Work Group (OSWG) in the Z-Wave Alliance. We’ll have to wait and see…

The Case FOR an RTOS – Pros

I want to again note that I am talking about using an RTOS for small IoT devices. There are many other applications and environments for an RTOS which have different Pros/Cons. A few of the main features of an RTOS for IoT are:

  • MultiTasking, MultiThreading, Mutexes, Mailboxes, Queues
  • Priority Based Resource Scheduling
  • Standardized Resource Drivers
  • Modularity & Code Reuse
  • Security Features

The Case AGAINST an RTOS – Cons

Measuring the complexity and bug rate introduced by an RTOS unfortunately can’t be quantitatively measured. I contend that in the case of Z-Wave the complexity has outweighed the benefits. The “features” of an RTOS lead to its complexity. For one task to communicate with another, you need to setup queues in both directions. That’s a lot of code and RAM where a simple handshake would most likely do the job as was done in the Bare Metal days.

  • Complexity
  • Resource Usage – CPU, FLASH, RAM
  • Development Tools
  • Training of developers

Final Thoughts

Simple devices like light switches, sensors, window shades, and the like barely need an RTOS. These simple devices rarely need multiple tasks or the other features compared to the complexity added. More complex devices like thermostats and door locks often have a high performance application CPU where even more resources are available for things like OLED screen drivers and fingerprint readers. In this case, the Z-Wave chip is relegated to a minor role of just providing wireless connectivity which again does not need an RTOS. All that being said, the current Z-Wave protocol is fundamentally based on FreeRTOS so the To RTOS or Not To RTOS question has already been settled – To RTOS we go!

One final point on code reuse – I find Code Reuse to be a double edge sword. One the one hand, the name sounds very attractive – code once, use many times. The reality is that most code is not reusable and in the effort to make it modular, more bugs are introduced than are saved. In many cases I can write a function in a fraction of the lines of code compared to the “driver” that does it all for every flavor of chip. There’s many research papers that discuss that bugs/line of code is fairly constant. So the fewer lines of code, the fewer bugs. The fewer lines of code the easier to read and to test. Not to say that all reusable code is bad and certainly code that has been extensively tested in many ways is super valuable, but every engineer needs to make that judgement for their specific application. That’s why you get paid the big bucks!

Detecting RF Jamming

Cheap RF Transmitter

All wireless protocols can be jammed often using an inexpensive battery powered transmitter. The protocol doesn’t even have to be radio frequency (RF) based as Infra-Red (IR) and any other communication medium that travels thru the air can be jammed by blasting out noise in the same spectrum as the protocol. Think of a busy street corner where you and a friend are having a conversation and a firetruck with their sirens blareing go by. Your conversation stops because your friend simply can’t hear you above all the noise. The same thing can happen in Z-Wave where a “bad actor” brings a small battery powered transmitter and blasts out RF in the same frequency bands that Z-Wave uses. In this post I’ll explain how to jam Z-Wave and also how to detect and inform the user that jamming has occurred.

Security System Requirements

Jamming applies primarily to security systems. After all, if someone wants to jam your house from turning on the kitchen lights at night, what’s the point other than to get a laugh when you bang your knee into the table? Z-Wave has enjoyed a great deal of success in the security system market. Z-Wave is interoperable, easy to use, low-power and the mesh networking protocol means users or installers don’t have to be concerned with getting everything to talk to everything else as the protocol automatically handles (mostly) everything. Security systems however are very concerned about jamming to the point that Underwriters Laboratory has a specification for it. UL1023 is the US standard for Safety Household Burglar-Alarm Systems.

The reality of the situation for a security system is that it is unlikely a burglar will try to bypass your security system by jamming it. Burglars are simply not that tech savvy. The FBI doesn’t even track the numbers of burglaries via jamming – one would assume because the number is essentially zero. A burglar will simply bash in a window or door or more often simply walk in an unlocked door. However, if it’s easy enough and cheap enough, a burglar might just try! CNET demonstrated just how easy it is to use a $3 transmitter to bypass a popular security system using a cheap RF transmitter. Regardless of the reality of the situation, the bad press of having an easy to jam security system can crater a company.

Anti-Jamming Techniques in Z-Wave

Z-Wave was designed from day one to be robust and reliable. The very first requirement for robustness is to acknowledge that the device receiving the message did in fact receive it. Every Z-Wave message is acknowledged (ACK) otherwise the sender will try again using different mesh routes or other RF frequencies. After several retries, the protocol will give up and the application can then decide if it wants to try even more ways to deliver the message. If the message is not very important (like a battery level report), the application can just drop it. If a sensor detects smoke! Then the application will continue trying to get this life-safety message thru in every way possible for as long as possible.

Z-Wave requires two-way communication – all messages are acknowledged

Here’s a list of the techniques Z-Wave uses for robustly delivering messages:

  • Z-Wave
    • All frames are Acknowledged
    • Multiple mesh routes
    • Frequency Hopping – Two frequencies – 3 different baud rates (in US)
    • RSSI Measurements indicating jamming
    • Supervision CC confirms decryption & data integrity
  • Z-Wave Long Range
    • All frames are Acknowledged
    • Dynamic TX Power
    • Frequency hopping to alternate channel
    • RSSI Measurements indicating jamming
    • Supervision CC confirms decryption & data integrity

Even with all these different measures in place, it is still possible to jam Z-Wave. But it’s not cheap nor is it easy. But let’s give it a try for fun!

Jamming Z-Wave

Jamming Z-Wave starts with a Silicon Labs Z-Wave Developers Kit and Simplicity Studio. However, these kits are not cheap costing at least $150 for just one. It may be possible to find a cheap 900MHz transmitter but you will need two of them and they must have the ability to tune them to the specific Z-Wave frequencies of 908.4MHz and 916MHz in the US. These are not going to be $3 battery powered transmitters and they require a significant amount of technical knowledge. Neither cheap nor easy so I think we’re pretty safe from your typical burglar.

Z-Wave uses two channels (frequencies) in the US: 908.4MH for 9.6 and 40Kbps and 916MHz for 100Kbps. Z-Wave Long Range (ZWLR) also has two channels but uses spread-spectrum encoding which spreads the signal out across a band of frequencies centered at 912MHz and 920MHz. By using two channels Z-Wave is frequency agile which makes it harder to jam since you need two transmitters instead of just one. The spectrum analyzer plot below shows four DevKits blasting all 4 channels at once.

Z-Wave jamming all four frequencies – 912 & 920 are Z-Wave Long Range

Creating the jammer firmware utilizes the RailTest utility in Simplicity Studio V5. Select the DevKit in the Debug Adapters window, click on the Example Projects & Demos tab then check the Proprietary button. The only example project should be the “Flex (RAIL) – RAILtest application”. Click on Create and use the defaults. The default frequency will state it is 868 but ignore that as the Z-Wave modes are all built into RailTest and do not need to be configured. Once the project is created, click on Build and then download to a devkit. Right click on the devkit in the Debug Adapters window and click on Launch Console. Click on the Serial 1 tab then click in the command box at the bottom and press ENTER. You should get a RailTest prompt of >.

Once you're at the RailTest prompt, enter the following commands:

rx 0                 -- disables the radio which must be done before changing the configuration
setzwavemode 1 3     -- Puts the radio into Z-Wave mode
setpower 24 raw      -- 24=0dbm radio transmit power - valid range is 1 to 155 but is non-linear
setchannel 0         -- ch0=916 ch1=908.4 ch2=908.42 - ZWLR ch0=912 ch1=920
setzwaveregion 1     -- EU=0, 1=US, 13=US Long Range
Do one of the following 2 commands:
SetTxTone 1          -- narrow band Carrier Wave - unmodulated
SetTxStream 1        -- Pseudo-Random data - modulated and in ZWLR uses Spread Spectrum (DSSS) 
Use the same command with a 0 to turn the radio off
Remember to "rx 0" before changing any other configuration values

RAILtest is a powerful utility and can do all sorts of things beyond just Z-Wave. The radio in the Silicon Labs chips are Software Defined Radios, they can be customized to many common frequency bands. It is easy to create customized versions of RAILtest that will transmit a carrier wave (CW) or a modulated signal at just about any frequency band, not just Z-Wave. But that’s more complex than I have time to discuss here.

Now that we know how to jam, how do we detect it and inform the user that jamming is taking place? Detecting jamming takes place at both ends of the Z-Wave network, the Controller and the End Device. Let’s first look into the End Device which in a security system is typically a motion sensor or a door/window sensor.

End Device Jamming Detection

Most end devices are battery powered so they spend most of their time sleeping and are completely unaware of any RF jamming that might be taking place. Only when motion is detected or a door is opened will the sensor wake up and find the radio waves being jammed. The best way to check for RF jamming is to first try to send a message. When the message fails to be acknowledged, then start looking to see if jamming is occurring.

The Z-Wave Application Framework (ZAF) handles sending the message and eventually calls a callback to report status. The callback comes through EventHandlerZwCommandStatus() which will be called several seconds after sending the message. The protocol tries various mesh routes, power levels and baud rates which takes time so be sure to stay awake long enough to receive the callback. The callback returns the TxStatus variable which is typically TRANSMIT_COMPLETE_OK (0x00) which means the message was delivered. But if jamming is taking place and the radio was unable to go through it, you’ll get a TRANSMIT_COMPLETE_FAIL (0x02). This status is different than the TRANSMIT_COMPLETE_NO_ACK (0x01) which means the message was not acknowledged which is usually because the destination is offline but could also be due to jamming.

The next step is to verify that jamming is taking place by getting the current Received Signal Strength Indicator (RSSI) level by queuing the EZWAVECOMMANDTYPE_GET_BACKGROUND_RSSI event . The RSSI is a simple value in dB of the strength of signal at the radio receiver when its not actively receiving a frame. In normal operation, this value should be around -100dB. Every environment is different so the threshold for the radio being jammed needs to be a value that is significantly higher than the average value. This is particularly tough in dense housing like apartments where perhaps every unit has a Z-Wave network. This results in a relatively high RSSI average. The key here is you can’t use a simple hard-coded threshold for jamming detection based on RSSI. Instead you must average the RSSI values across a long time-span (typically hours).

Z-Wave Notification of Jamming

The next step after detecting jamming has occurred is to notify the hub. But if the jamming is still in progress, how can the notification get thru? Naturally you can’t get thru while the jamming is still happening. The trick is to keep trying and hope that the jamming is short term. The problem is that a battery powered sensor can’t keep trying constantly as it will run out of battery power perhaps in just a few minutes. You must manage battery power and at the same time keep trying with a longer and longer timeout between attempts. At some point the jamming should end, perhaps hours after the initial break-in but the jammer will eventually run out of battery power.

The Z-Wave Notification Command Class has a pre-defined value for RF Jamming – Notification Type of Home Security (0x07) with an Event of RF Jamming (0x0C) and the current average RSSI level. This notification is a critical notification so it should be wrapped in Supervision Command Class to guarantee it has been delivered and understood by the controller.

Sample Code

The code below first checks the TxStatus, if is not OK, then the RSSI level is checked by queuing the GET_BACKGROUND_RSSI event. Once the RSSI is sampled, the function will be called again with the switch going thru the GET_BACKGROUND_RSSI case below. This section of code then compares the current RSSI level with a background RSSI level and if the current level is above it then the SendRFJamNotificationPending global variable is set. When a frame is able to get thru then the pending RF Jam notification is sent since it appears the jamming has ended. This ensures the Hub is informed that there was jamming so the Hub can then decide if it needs to inform the user. The basics of the algorithm are coded here:

... 
static void EventHandlerZwCommandStatus(void)
...
switch (Status.eStatusType)    
...
    case EZWAVECOMMANDSTATUS_TX:  // callback from attempted message delivery
...
            
        if (pTxStatus->TxStatus != TRANSMIT_COMPLETE_OK) { // failed to deliver - check RSSI
            EZwaveCommandType event = EZWAVECOMMANDTYPE_GET_BACKGROUND_RSSI;
            QueueNotifyingSendToBack(g_pAppHandles->pZwCommandQueue, &event, 0); // Queue GET_RSSI
        } else { // message delivered OK
            // more cleanup happens here...
            if (SendRfJamNotificationPending) { // Is there a pending Jam Notification?
               SendRfJamNotificationPending=false;   // Send it!
               void * pData = PrepareNotifyJamReport(&zaf_tse_local_actuation);
               ZAF_TSE_Trigger((void *)CC_NotifyJam_report_stx, pData, true);
            }
        }
...
    case EZWAVECOMMANDSTATUS_GET_BACKGROUND_RSSI:  // only called if failed to deliver a message
        if (Status.Content.GetBackgroundRssiStatus.rssi > BackgroundRSSIThreshold) {    
            // Set a global to send an RF Jamming Notification which will be sent when jamming ends
            SendRfJamNotificationPending=true;
            SendRfJamNotifRSSI= Status.Content.GetBackgroundRssiStatus.rssi;
        }
... // Not shown are application level retries and various other checking

Now that we have jamming detection enabled on the end-device side, let’s look at the controller end of the communication.

Controller Jamming Detection

Obviously the main thing the controller needs to do is react to a jamming notification from an End Device. The ultimate action the controller performs is left to the controller developer but clearly the end user should be notified that jamming has been detected. But that notification needs to be qualified with enough information about the average RSSI noise level to avoid false jamming detection notifications.

If the jammer is way out at 200+ meters, the RSSI level may not jump up significantly as measured by the controller. Thus, it is important to react to the End Device notification of jamming. However, the controller must poll the RSSI level at regular intervals to determine if jamming is taking place nearby. The question is how often should it poll and when to react to a sudden change in the RSSI level? There is no definite answer to this question other than “it depends” and it depends on a lot of different factors. Typically, the RSSI should be sampled a few times per minute – perhaps every 30 seconds. If a value seems unusually high, perhaps sample several more times at a much faster rate to confirm that the RSSI has jumped and its not glitch. Like the End Device case, the average RSSI value needs to be calculated across a fairly long time frame (minutes to perhaps an hour) and when there is a change from the average value then the user should be notified.

ZW_GetBackgroundRSSI

The SerialAPI function ZW_GetBackgroundRSSI() (0x3B) will return three or four bytes of RSSI values for the various channels supported by the controller. This function can be sent to the Z-Wave controller frequently as it does not cause any delays in the radio. It does use UART bandwidth so it can’t be called too frequently or it may interfere with normal Z-Wave traffic. The polling function should coded with a low priority so it is only sent when the UART has been idle for a few seconds to avoid collisions with Z-Wave radio traffic. The one-byte RSSI values are coded as shown in the table below.

RSSI values returned by the ZW_GetBackgroundRSSI():

HexDecimal (2s Comp)Description
0x80-0xFF-128 – -1Measured RSSI in dBm
0x00-0x7C0 – 124Measured RSSI in dBm
0x7D125RSSI is below sensitivity and cannot be measured
0x7E126Radio saturated and could not be measured as it is too high
0x7F127RSSI is not available

Typically a 700 series Z-Wave controller will measure about -100dBm when the airwaves are fairly quiet. During a transmission the RSSI is often about -30dBm when the node is within a few meters of the controller.

TxStatusReport

The TxStatusReport is returned after a frame was transmitted which includes several fields with a variety of RSSI measurements. There is a Noise Floor of the sender as well as a NoiseFloor of the receiver. The RSSI values can be monitored during normal Z-Wave traffic without polling. It is best to use these values while Z-Wave traffic is taking place and to temporarily pause the polling while the Z-Wave UART is busy. Once the UART is idle, resume RSSI polling.

Missing Heartbeats

Another aspect of jamming is that battery powered devices typically send a “heartbeat” message every hour so the controller knows for sure the device is online and working (mostly that the battery isn’t dead). The controller should be keeping track of how long it has been since the last time a battery powered node has checked in and if it has missed two or at most three heartbeats, the controller should inform the user (or the installer) that the device is offline and unable to communicate. If the battery was already low, then the battery is probably dead. If the battery was fine, then there is a possibility that the device is being jammed.

Time vs. Clock vs. Time Parameters Command Classes Explained

Introduction

Door locks, thermostats and other Z-Wave devices often need to know at least the time and day of the week. In many cases they need to know the full date and time to enable a lock User Code when a renters code is valid or set the thermostat into energy save mode. These devices need a way to determine the current date and time to within a few seconds of accuracy.

Z-Wave provides three different command classes (CC) for getting various parts of the date/time. Time Command Class is mandatory for all Gateways. Unfortunately, not all gateways support it yet, so most devices need to support one of the other command classes for use with older hubs. The question then is how is a device supposed to get the current date/time so the schedule can operate properly?

Time CC – Recommended

Time command class is described in SDS13782 (Z-Wave Management Command Class Specification). Time CC is mandatory for all Z-Wave Plus Gateways and thus is the recommended method for a device to set its clock to the current local date and time. Time CC Version 2 adds time zones and daylight savings time support if desired however V1 provides the necessary data in most cases.

The Z-Wave specification recommends having an association group to identify the time server node however the Gateway is expected to have an accurate time reference so using the Lifeline is acceptable.

The Time CC does NOT have a date/time SET command. Thus, the hub cannot set the date/time and instead must wait for the device to GET it. When a device is included in a network, it must send a Time GET command within the first few minutes to accurately set its internal clock. The device should then periodically send a Time GET to ensure the internal clock remains accurate to the local time. Note that for certification purposes a device CONTROLs Time CC, it does not SUPPORT it. The Hub is required to SUPPORT Time CC.

Time Parameters CC – Optional

The Time Parameters command can SET/GET/REPORT the year, month, day, hour, minute & second of the UTC time. However, it does not set the time zone which must be done via the Time CC V2. Thus, Time Parameters CC relies on the hub to send the current UTC time but the device can also send a GET and adjust its internal clock to match the one from the hub. However, this requires support on the hub software which is not mandatory so not all hubs will be able to provide the current date/time.

Clock CC – NOT Recommended

Clock command class is sent by a Hub and can set the local weekday and time. Thus, it only supports a 7-day schedule since it cannot set the date, just the day of the week. Typically, the Hub would send a Clock Set as part of inclusion in the network. Since the clock on the device will drift, the device must periodically send a Clock Get to the Hub and to maintain time accurately. This method is NOT recommended. However, on some old hubs this is the only method available.

Recommended Time Setting Algorithm

  1. Wait for Inclusion into a Z-Wave Network
  2. Wait for Security negotiation to complete
  3. Send a Time CC DATE GET
  4. Wait for a Time CC DATE REPORT for ~30s
  5. If DATE REPORT arrives, Send a Time CC TIME GET and wait for ~30s
    1. if the Time REPORT arrives then the date/time is now set and use Time CC for future clock adjustments
    1. Exit the search for the local time
  6. If Time CC DATE REPORT times out:
    1. Retry 2 more times with random delay of a few minutes between each retry
  7. During steps 3-6, If a Time Parameters CC SET or a Clock CC REPORT is received, use those to update the date/time but if a Time CC report arrives use Time CC
  8. Send a Clock CC GET
    1. If a REPORT arrives within ~30s then use Clock CC GET to update the date/time
  9. If CLOCK fails
  10. Send Time Parameters CC GET to get the current date/time
  11. If those fail, there is no source for the current date/time, disable all scheduling features

Depending on the accuracy of the local clock circuitry, the functioning time setting command class should be used to update the local clock at a sufficient rate to match the desired settings. Typically, this would be once per day assuming a 100ppm or better 32Khz crystal is used for the 700 series low frequency external crystal oscillator (LFXCO).

Conclusion

End Devices should send a Time CC Date/Time GET shortly after inclusion in a Z-Wave network and then periodically send Date/Time GETs based on the accuracy of the real-time clock circuitry. Updating at 3:10am ensures the clock will be accurate to daylight savings time should be sufficient for a low-cost 32kHz crystal. The algorithm above works for just about any hub that has at least minimal support for time keeping.

How to OTA a co-processor via Z-Wave

You have a second MCU or other data files you want to update using Over-The-Air (OTA) via Z-Wave. How can you reuse the Bootloader firmware to verify the signature and decrypt the data?

The code to verify and decrypt the file already exists in the bootloader and is known good. Reusing the existing bootloader code is smaller and safer than re-inventing the wheel – or in this case encryption.

The attached project is a modified Z-Wave Door Lock Key Pad sample application that demonstrates how to OTA code/data other than the Z-Wave firmware. OTA of the Z-Wave firmware works in the sample application already – but first the encryption keys MUST be generated. See https://www.silabs.com/community/wireless/z-wave/knowledge-base.entry.html/2019/04/09/z-wave_700_ota_ofe-i00M on how to generate the keys. See the two .BAT files in the comments section which will run all the necessary commands for you. They are also included in this .sls file in the KEYS directory. You MUST create your own project keys to OTA either the Z-Wave Firmware or any other data.

To OTA other types of files you need to start with a binary file. Most microprocessor development environments will output a binary file so use that instead of a HEX file. If you have an Intel hex or Mototola S record file, use a utility like SREC_CAT to convert it to a binary file. SREC_CATcan convert just about any file type into any other file type. If the file is more than 200K bytes, you will need to break the file into 200K or smaller files and OTA each, one at a time. Doing that is beyond the scope of this project. Note there is no need to encrypt the file. We will be using Commander to sign and encrypt it using the keys generated here.

Theory of Operation:

Changes to the SSv4 DoorlockKeyPad sample project are indicated with the comment “AKER” – search for these to find what changed. You can also diff the files with a fresh copy of the DoorLockKeyPad sample app from SSv4. Most of the code to support OTA of an external processor is in this file. A few changes have been made to ota_util.c in ZAF_CommandClasses_FirmwareUpdate but these are expected to be included in a future release of the SDK (currently tested on 7.13). 

Commander is used to generate a pair of public and private keys. The private key is then programmed into every device to be OTAed. Commander then encrypts and signs the binary file and wraps it with bootloader tokens. The gbl file is downloaded, the signature checked and the encrypted data is then passed to a callback function 64 bytes at a time. You then have to store the data or pass it to the external MCU. This example simply prints the data out a UART.

Procedure:

Step 1: Generate the keys

There two .BAT files in the KEYS directory for this project. These are windows script files. For other platforms you can easily convert them to the platform specific commands. See the comments in the files for more details. In a windows shell type:
   GenGblToken.bat
This will use Commander to generate a project set of keys in the files vendor_*.*. Only execute this command ONCE. The same keys are used for the duration of the project. If you change the keys then you cannot OTA the devices as the keys no longer match.

Step 2: Program the key into a devkit and every DUT

Each device manufactured must have the private key programmed into FLASH. Use the PgmToken.bat to program the key into a target device connected via USB. Note that EVERY unit manufactured must have these keys programmed into it.

Step 3: Generate the .gbl file

Create the .gbl file from the binary file using the following command:
   commander gbl create <OTA_FileName>.gbl –metadata <BinaryFile> –sign vendor_sign.key –encrypt vendor_encrypt.key
The –metadata option will wrap the binary data with the necessary tokens for the bootloader to parse the data. Do not use the –compress option. If the data needs to be compressed, use your own algorithm for that. There are 3 sample binary files in the KEYS directory – a small .WAV audio file, a large .M4A audio file and a PNG image file. Use the command above to wrap the file with the necessary tokens for OTA.

 Step 4: OTA the .gbl file

Use the PC Controller or other application to send the gbl file over Z-Wave. Once the entire file has been sent and the CRC checked to be good, the FinishFwUpdate function is called to begin processing the image. Note that in the PCC you have to first GET the Current Firmware, then select the Target: 1 to download the metadata. Then click on UPDATE and the OTA will begin. Connect a terminal to the VCOM port of the WSTK to view the data streaming down during the OTA. Once all the data is sent down, the signature is checked and the decrypted data is sent out the UART. This is where you would need to change the code to store the data instead of printing it out the UART.

Step 5: Verify the Signature and pass in the callback function

The bootloader_verifyImage() function is called and the metadataCallback function is passed in. bootloader_verifyImage first returns a zero if the signature matches. If the signature fails an error value is returned giving some details on why it failed. The time to verify the signature can be fairly long depending on the size of the image so the watchdog timer is disabled during the processing.

Step 6: MetadataCallback passes blocks of 64 bytes of the decrypted data

The function passed in to bootloader_verifyImage is called with a pointer to the data and the number of bytes in each block. The size of the block can vary up to 64 bytes. In this example the data is simply printed out the UART. In your application you would replace this function with code to store the data as needed on the other MCU or external NVM.

Step 7: Reboot

It is recommended to reboot after the image data has been stored to ensure the FLASH is cleaned up properly. The current demo however does not reboot.

Note: This is an SSv4 SDK 7.13 sample but the same concepts should work in SSv5. The changes to ota_util.c will be folded into the SDK in a future release but for now those changes are necessary.

The code example can be downloaded from the Silicon Labs web site at: https://www.silabs.com/community/wireless/z-wave/knowledge-base.entry.html/2020/09/23/ota_a_co-processororotherdataviaz-wave-GDap

Z-Wave Works With Amazon, Google, Samsung, Apple, Comcast Virtual Conference

Silicon Labs is hosting what was intended to be an in-person conference in Austin Texas but is now a virtual online conference on IoT ecosystems – the Works With Smart Home Developer Event September 9-10. The best part is it is now FREE to attend any of the in-depth technical sessions and you don’t have to wear a mask. The downside is that we don’t get to experience all that great music down in Austin – well, there’s always next year!

Virtual IoT Works With EcoSystems from Google, Amazon, Apple for Z-Wave development engineers
https://workswith.silabs.com/

I am hosting the Z-Wave track and will be making several presentations including a detailed look at Silicon Labs latest release of Simplicity Studio V5 which just came out yesterday. We’ll also have presentations on developing Z-Wave Smart Hubs and Z-Wave Certification. I’ll also be describing some IoT failures – you learn more from your failures than your successes. We have speakers and engineers from all of the ecosystem partners, not just Silicon Labs folks. Learn from the experts from across the industry!

What is Works With 2020? The smart home developer’s virtual event where you will have the opportunity to interact with our ecosystem partners from Amazon, Google, Samsung, and Z-Wave to connect devices, platforms and protocols and be able to immerse yourself in keynotes, a panel discussion on Project CHIP, hands-on, and technical sessions led by smart home engineers who are building the latest advanced IoT devices. The Works With event is live, all-online, free of charge, and you can join from anywhere around the world.

Works With Z-Wave Apple, Google, Amazon, Samsung IoT SmartHome conference 2020

Click here to Register Today and feel free to forward to the rest of your team.

Here’s an overview of what you won’t want to miss:

Specialized Engineer-Led Tracks – Educational sessions and technical training designed for engineers, executives, developers, business development and product managers.

Hands-On Workshops More than 12 workshops and hands-on sessions to give you experience, knowledge and confidence to develop and accelerate smart home development.  

One-on-One Developer Meetings – Schedule a meeting with Silicon Labs or an ecosystem partner to get 1:1 technical guidance.

Join me in September and learn how to smoothly get your IoT device plugged into any and all of the ecosystem partners. Register today, it’s totally free and you can join from anywhere in the world. See you September!

How Much FLASH/RAM Am I Using?

One of the most common questions in embedded programming is “How much FLASH/RAM am I using?” or more precisely, “How much do I have left before I run out?” or even “How much do I have to squeeze my code to fit in the available space?” Yikes! Very often the code size quickly fills to fit the available space and then you start struggling to fit all the features in your product. This problem afflicts the Z-Wave 700 series just as much as any other IoT development. I’ll give you a few hints on tools to measure the code size and figure out where the bloat is and options to squeeze a little more code in.

ZGM130S Resources

The first step is to understand how much FLASH/RAM we have in the Z-Wave ZGM130S. Open the datasheet and we see there is 512K FLASH and 64K RAM. Seems like a TON! But wait, a closer look at the datasheet and there is a note that only 64KB FLASH is available for the application and 8KB RAM. That’s not a lot for a complex IoT device like a thermostat with an OLED screen but is plenty for a simple on/off light switch. Like any engineering trade off, the chip balances the available resources to match the most common use cases.

The Z-Wave stack isn’t huge so fortunately there is sufficient space available for most applications. However, the stack developers have reserved most of the the FLASH and RAM space for future upgrades. There is no easy to use tool that precisely measures how much code space is being used for the stack versus the application. In this post I’ll give you some tools to see how close you are to the total and then subtract a typical sample application size to find the amount your application is using. INS14259 section 5.1 gives the typical FLASH usage for the Z-Wave sample applications.

Half of FLASH (256K) is reserved for the Over-The-Air (OTA) firmware image. This block of flash is used when the firmware is updated and the data is stored here temporarily until the signature is checked and the code can be decrypted. Once that test has passed then the code is copied down into the normal FLASH space and the chip reboots into the new firmware version. If you need a lot more than 64K of FLASH you can consider moving the OTA storage from the upper half of the ZGM130S to an external serial FLASH. This is supported in the Silicon Labs Gecko Bootloader but requires some coding to free up all that space. This also requires hardware support for the external FLASH chip. So if you think you’re going to be short on code space, I highly recommend adding a serial FLASH chip even if you don’t use it right away. I plan to describe the OTA to external FLASH process in a future blog posting so stay tuned.

ARM Tools

Before starting with code size analysis be sure you are working with the “release” build and not the debug build. Click on Project->Build Configurations->Set Active and select the Release build. Then build the project. The debug build uses minimal optimization and has tons of ASSERT and PRINTF code in it which invalidates the code size analysis.

ARM eabi-size

When you compile a Z-Wave project it will run the arm-none-eabi-size -A <project.axf> command which prints out an obscure listing of the sizes of various FLASH segments. The DoorLockKeyPad sample application produces the following:

DoorLockKeyPad.axf  :
section             size        addr
.nvm3App           12288      475136
.simee             36864      487424
.text             168760           0
_cc_handlers         120      168760
.ARM.exidx             8      168880
.data               1132   536870916
.bss               28956   536872048
.heap               3072   536901008
.stack_dummy        1024   536901008
.ARM.attributes       46           0
.comment             126           0
.reset_info            4   536870912
.debug_frame        1120           0
.stabstr             333           0
Total             253853
  • What does all this mean?
  • FLASH = .text + .data
    • .text = code which lives and runs out of on-chip FLASH
    • .data = initialized variables
      • IE: int myvar=12345; results in 12345 being stored in FLASH and then copied to RAM on power up
      • Thus .data uses both FLASH and RAM
    • The other 2 segments are in FLASH space but subtract from the total available
    • .nvmApp = Application non-volatile memory
    • .simee = SDK non-volatile memory
  • RAM = .bss + .data
    • .bss = Variables not explicitly initialized
      • gcc normally zeroes on power up
    • .data = initialized variables
    • .heap = heap used for dynamic memory allocation
    • .stack = the stack for pushing return addresses, function parameters and other things
  • The other segments can be largely ignored
  • The available FLASH is 256K minus the .simee and .nvmApp=256K-12K-36k=208K
  • The available RAM is 64K minus the heap/stack=64K-3K-1K=60K
  • Thus:
  • FLASH=168760+1132 = 169,892 bytes = 80% utilized
  • RAM=28956+1132 = 30,088 bytes = 49% utilized

You can see that the SDK code and the application are all mashed together without a way to identify how much the application is using. But at least you know when you are running out. Note that each release of the SDK will change the amount of flash used by the SDK code and possibly the ZAF. Note that the ZAF is considered part of the Application code.

Commander Flash Map

Another easy way to check how much FLASH is being utilized is to use Commander to display a map of FLASH. Start commander and connect to the DUT then use Device Info->Flash Map to get a chart like this one:

ARM eabi-nm

If you want to know which functions and variables are the biggest chunks of FLASH/RAM usage use the nm command: arm-none-eabi-nm <project.axf> --print-size --size-sort -l | tail -30

Address  Size   Type Symbol
00018c84 00000444 t process_event
0001c760 00000454 T IsMyExploreFrame
000172a4 00000454 T TransportService_ApplicationCommandHandler
000185aa 000004d2 T S2_application_command_handler
0001de00 000004e4 T crypto_scalarmult_curve25519
0001098c 0000054c T IsMyFrame
00017ee4 00000590 t S2_fsm_post_event
00010318 00000674 T IsMyFrame3ch
20006c14 00000708 B channelHoppingBuffer
000138a0 000007e8 T CommandHandler
00021960 00000888 T FRC_IRQHandler
00011790 00000890 T ReceiveHandler
2000628c 000008ac B the_context
20007590 00000c00 N __HeapBase
00019788 00000e04 T mbedtls_internal_sha1_process
00026f68 000019cc T RAILINT_0cdb976df793f6799e20dfa42e2be4c6
00074000 00003000 b nvm3AppStorage
00077000 00009000 B __nvm3Base
00077000 00009000 B nvm3Storage

The third column need a little decoding: T/t=.text (FLASH), B/b=.bss (RAM) D/d=.data (both FLASH and RAM)

You can also tell if it’s FLASH or RAM by the address – FLASH starts at 0 and RAM starts at 0x20000000. Starting from the bottom of the list above you can see that the NVM3Storage is 36K which is naturally the largest block of FLASH. Followed by the 12K of NVM3 Application storage. From there the sizes drop fairly quickly but you can guess the function based on the name. RAILINT is a bunch of Hardware Abstraction Layer (HAL) code. mbedtls is the Security S2 encryption functions. The HEAP is the largest single block of RAM followed by “the_context” which is a fairly large structure the ZAF and the SDK use to store the security and routing information.

Now that you can see the heavy users you can see if there is something amiss. Perhaps a buffer can be reused instead of using unique buffers for various functions. Look carefully for any unused functions in your source code. GCC often will leave “dead” code in place because it can’t tell if you’re using it as a dynamic callback function so to be safe it leaves the code in there. Thus, review your code and make sure you don’t have dead functions or variables or entire buffers that are never used.

The most common method to squeeze more code in is to try various options in the GCC compiler. The more recent versions of GCC have added Link Time Optimization (LTO) which can significantly reduce the code size (claims are up to 20%!). Simplicity Studio is moving to newer versions of GCC later this year so more of these options will be available. Worst case is to refactor your code to make it more efficient or drop features.

Other Tools

There are other tools like Puncover and Bloaty which can help with managing code size growth. I haven’t personally tried these but they seem like they would help. If you use a tool that helps manage code/RAM let me know in the comments below. We all need help in squeezing into the available space which is never enough!

Z-Wave Virtual Academy

Z-Wave Virtual Webinar Wednesdays at Noon Eastern US time

Doctor Z-Wave will be giving a hands-on live demo of getting started using Z-Wave with Simplicity Studio on Wednesday June 17. This is a live demo with just a couple of slides so you don’t want to miss it. The session is a short roughly 30 minutes with time for Q&A afterward. I will show you some simple things on setting up Simplicity to make your life easier when getting started. If you can’t make it, it will be recorded and available via the Alliance web site.

There are lots of other topics for Webinar Wednesdays:

Webinar Wednesday Schedule*: *This schedule will be updated regularly on the Z-Wave Alliance website as the series progresses
May 27, 2020  
  Manufacturing During a Global Pandemic: Insight & Strategy from Companies Who Are Coping Hosted by: Avi Rosenthal – Bluesalve Partners  
June 3, 2020   Social Distance Sales for Uncertain Times: Tips & Insight for Integrators Hosted by: Jeremy McLerran – Qolsys
June 10, 2020   Residential Smart Lock Market: Trends, Use-Cases & Opportunities Hosted by: Colin DePree – Salto Systems
June 17, 2020 Z-Wave 700 Series: Getting Started Hosted by: Eric Ryherd – Silicon Labs
June 24, 2020  
  Feature of Leedarson Z-Wave 700 Series Security Products Hosted by: Vincent Zhu & Michael Bailey Smith – Leedarson
 

Fast GPIO Wake Up in Z-Wave 700 Series

The Silicon Labs EFR32 family of IoT microcontrollers are very flexible and can do a ton of cool stuff. However, along with all that flexibility comes a lot of complexity. With that complexity are default settings that work fine for many applications but in some cases you want to dig into the details to come up with an optimal solution. In this post I’ll show how to speed up the wake up time for the Z-Wave ZGM130S chip from a GPIO.

But first – a caveat: This post applies to Z-Wave SDK 7.13.x. Future releases of the SDK may have different methods for sleep/wake and thus may require a different solution.

The Problem

Frequently Listening Routing Slaves (FLiRS) devices like door locks and many thermostats spend most of their time in Energy Mode 2 (EM2) to conserve battery power. Once per second they wake up briefly and listen for a Beam from an always-on device. If there is a beam, the FLiRS device will wakeup and receive the Z-Wave command. This allows battery powered devices to use very little power but still be able to respond to a Z-Wave command within one second. FLiRS devices use more battery power than fully sleeping devices like most sensors which use Hibernate Sleep mode (EM4). To wake every second the ZGM130 has to wake quickly and go right back to sleep to minimize power. The problem with EM4 is that it takes a few tens of milliseconds to wake up as the entire CPU and RAM have to be initialized as they were powered down to save power. For a FLiRS device, it’s more efficient to keep RAM powered but in a low-power state and resume quickly to go right back to sleep if there is no beam. Typically the ZGM130 can wake up in about 500 microseconds from EM2. But in many cases this is still too long of a time to stay awake if there are other interrupts such as UARTs or other sensors.

The scope shot above shows the processing that takes place by default on the ZGM130S. In this case I am using a WSTK to drive the SPI pins of another WSTK running the DoorLockKeyPad sample application. The chip is in EM2 at the start of the trace. When SPISEL signal goes low, the chip wakes up. But it is running on the HFRCO oscillator which is not accurate enough to run the radio but it is stable and usable in just a few microseconds. Thus, the SPI clock and data is captured in the USART using this clock. However, by default the Interrupt Service Routine is blocked waiting for the HFXO to stabilize. The 39MHz HFXO crystal oscillator has the accuracy required for the radio.

The question is what’s going on during this 500usec? The answer is the CPU is just waiting for the HFXO to stabilize. Can we use this time to do some other work? Fortunately, the answer is YES! The challenge is that it takes some understanding and some code which I’ll describe below.

The Solution

There are three functions that do the majority of the sleep processing. These are provided in source code so you can read the code but you should not change it. Instead you’ll provide a callback function to do your processing while the chip is waking up.

Simplified Sleep Processing Code:

  1. SLEEP_Sleep in sleep.c: The main function called to enter sleep
    1. CORE_ENTER_CRITICAL – PRIMASK=1 mask interrupts
    2. DO-WHILE loop
      1. Call enterEMx() – this is where the chip sleeps
      2. Call restoreCallback (return 0 to wake, 1 to sleep)
    1. Call EMU_Restore – waits for HFXO to be ready ~500us
    2. CORE_EXIT_CRITICAL – ISRs will now run
  2. enterEMx() in sleep.c:
    1. sleepCallback called
    2. Call EMU_EnterEM[1-4]
    3. wakeupCallback after returning from EMU_EnterEMx
  3. EMU_EnterEM2 in em_emu.c:
    1. Scales voltage down
    2. Call EMU_EM23PresleepHook()
    3. __WFI – Wait-For-Interrupt instruction – ZGM130 sleeps here
    4. Call EMU_EM23PostsleepHook() ~ 17usec after wakeup
    5. Voltage Scale restored which takes ~20us

The code is in sleep.c in the SDK which has a lot more detail but at a high level this is what you need to know. The important part to understand here is where the “hooks” are and how to use them.

  • Use Sleep_initEx() to assign:
    • sleepCallback – called just before sleeping
    • restoreCallback – Return 0 to wake, 1 to sleep
    • wakeupCallback – called after waking
    • Sleep_initEx() input is a pointer to a structure with the three callbacks or NULL if not used
  • Define the function:
    • EMU_EM23PresleepHook()
    • EMU_EM23PostsleepHook()
    • These are both WEAK functions with nothing in them so if you define them then the compiler will install them

The two EMU_EM23* weak functions are run immediately before/after the Wait-For-Interrupt (WFI) instruction which is where the CPU sleeps. These are very low level functions and while you can use them I recommend using the callbacks from Sleep_initEx().

The SLEEP_initEx() function is the one we want to use and in particular the restoreCallback. The comments around the restoreCallback function talk about restoring the clocks but if the function returns a 0 the chip will wake up and if it returns a 1 then it will immediately go back to sleep which is what we want! You can use the other two hooks if you want but the restoreCallback is the key one since it will immediately put the chip back to sleep if everything is idle.

The key to using ANY of these function is that you CANNOT call ANY FreeRTOS functions! You cannot send any Z-Wave frames or call any Z-Wave function as they all require the RTOS. At this point in the wakeup processing the RTOS is not running! All you can do in these routines is to capture data and quickly decide if everything is idle and to go back to sleep. If there is more processing needed, then return 0 and wait for the event in the RTOS and process the data there. You also don’t want to spend too much time in these routines as it may interfere with the timing of the RTOS. A hundred microseconds is probably fine but longer you should wait for the HFXO.

In ApplicationInit() you will call Sleep_initEx() like this:

const SLEEP_Init_t sleepinit = {NULL, NULL, CheckSPI};
...
ZW_APPLICATION_STATUS ApplicationInit(EResetReason_t eResetReason) {
...
SLEEP_InitEx(&sleepinit); // call checkSPI() upon wakeup from EM2.
...
}
...
uint32_t CheckSPI(SLEEP_EnergyMode_t emode) { 
	uint32_t retval=0; // wake up by default
	if (GPIO_IntGetEnabled() & 0x0000AAAA) { // Check SPI
		GPIO_ODD_IRQHandler(); // service the GPIO interrupt
		// wait for all the bytes to come in and compute checksum 
		NVIC->ICPR[0] = NVIC->ICPR[0]; //clear NVIC pending interrupts
		if (!SPIDataError && !IsWakeupCausedByRtccTimeout())	{
			 retval=1; // go back to sleep!
		}
	}
	return(retval); // 0=wakeup, 1=sleep
}

Recall that every second the FLiRS device has to check for a Z-Wave beam which is triggered by the RTCC timer. Thus the check for IsWakeupCausedByRtccTimer ensures that the beaming still works.

This scope shot shows the wake up processing of the ZGM130S:

  1. SPISEL_N SPI chip select signal goes low triggering a GPIO_ODD interrupt
    1. The chip wakes up, the HFRCO begins oscillating
  2. HFRCO begins oscillating in a few microseconds
    1. Once HFRCO is running, the peripherals are functional
    2. SPI data can begin shifting once the HFRCO is running
    3. The default HFRCO frequency is 19MHz but can be increased
    4. Higher frequencies for HFRCO also may need more wait states for the CPU and will use more power
  3. The WFI instruction that put the CPU to sleep is exited here
    1. EMU_EM23PostSleepHook function is called if defined
    2. After returning from PostSleepHook, the VSCALE is returned to full power which takes about 10usec
    3. It is best to wait for the voltage to be powered up to ensure all logic is running at optimal speeds
  4. EMU_EnterEM2 is exited and restoreCallback is called if initialized
    1. This is the function where the ISR should be called to process data
    2. If the data says things are idle and want to go back to sleep, return 1
    3. If more analysis is needed, then return 0
    4. Carefully clear the interrupt bits
      1. First clear the peripheral Interrupt Flags
      2. Then clear the NVIC Interrupt pending register
        1. NVIC->ICPR[n]=NVIC->ICPR[n] where n is 0-1 depending on your interrupt
    5. Make sure there aren’t other reasons to wake up fully
      1. !IsWakeupCausedByRtccTimeout() is the 1s FLiRS interrupt
      2. There may be other reasons to wake up which is application dependent
  5. In this example the SPI data is being fetched from the USART at each toggle of the GPIO
    1. The final toggle shows that the checksum was computed and the data is idle so go back to sleep
  6. The chip returns back to sleep in a few more microseconds
    1. Total processing time of this interrupt is less than 200usec which is a fraction of the time just waiting for the HFXO to stabilize
    2. Much of that time is receiving and processing the SPI data
    3. It is possible to sleep in under 50usec if the check for idle is quicker

If your peripheral processing will take significantly less than 500usec, then it may be more efficient to process the data using the HFRCO and not wait for the HFXO to power up. But if your application needs more processing, then you are probably better off waiting. Each application must make their own calculations to determine the most efficient path.

What About Sleeping Devices?

Fully sleeping devices (EM4 also known as RSS – Routing Sleeping Slaves) have entirely different wake/sleep processing. For sleeping slaves the processor and RAM have to be re-initialized and the chip essentially boots out of reset. All that initialization takes quite a bit of time – a few tens of milliseconds. If your device needs to do a lot of frequent checking of a sensor, then it might make more sense to force it to stay in EM2 by setting a Power Lock to PM_TYPE_PERIPHERAL. For more details on power locks see INS14259 section 7.6. Deciding which way to go is application specific so you have to make the calculations or measurements to find the right balance for your project.

This is a complex posting but I hope I’ve made it clear enough to enable you to optimize your application firmware. Let me know what you think by leaving a comment below.

How to Upgrade Your 700 Series Project from SDK 7.12 to 7.13

This is a very specific posting for Z-Wave developers and specifically for those developing with the new 700 series chips. If you’re not a 700 series developer you can probably stop reading…

I have posted details on upgrading from the 7.12 to the 7.13 Software Developers Kit at this Knowledge Based Article on the Silicon Labs web site: https://www.silabs.com/community/wireless/z-wave/knowledge-base.entry.html/2020/03/30/upgrading_700_seriesprojectfrom7122to7133-VZrM

Z-Wave SDK 7.13.3 released last week with a number of important stability improvements – you want to upgrade your 700 series project to this release!

  • Several stability improvements to prevent lockups in certain corner cases
  • RSSI reporting corrections (both 500 and 700)
  • Improved timing for routed acks and fixed sticky Last Working Routes
  • OTA Firmware Activate support delaying rebooting into the new firmware until all units have been downloaded
  • Details are found in SRN14629.pdf which is included in the Simplicity Studio release: SDK Documentation->End Device->SRN14629 Z-Wave 700 SDK 7.13.x

Z-Wave Watchdog Timer Best Practices

WatchDogVirtually all embedded systems must run 24 x 7 x 365 x many many years without ever being rebooted. Since there is no one there to “press the reset button” if the device fails, the watchdog timer is there to do just that. The 500 series Z-Wave chips from Silicon Labs have a watchdog timer and the example code provides a very minimal use of the watchdog timer. However, the minimal use in the example code is not sufficient to provide a robust watchdog for embedded Z-Wave devices. This post explains some rules and methods to code a robust watchdog timer.

Long time embedded expert Jack Ganssle has a great article on Watchdog timers. He describes the use of a watchdog timer on the Clementine spacecraft where a fault in the system caused the spacecraft to dump virtually of its fuel resulting in the loss of the mission. The lead software engineer had wanted a watchdog but the designers decided not to include it. Jacks example shows how important it is to spend at least some time coding a robust watchdog for our IoT devices. While our devices aren’t controlling multi-million dollar spacecraft, we are coding light switches that are hardwired into the wall and cannot be easily rebooted. Try telling the customer to go into the basement and toggle the power to his entire house to reboot the light switches!

What is a Watchdog?

A watchdog timer is a timer that runs constantly. Typically a complex combination of events resets (or “kicks”) the watchdog timer every now and then, usually every few milliseconds. If the combination of events ever gets stuck, the timer will continue to run. If the watchdog timer “times out”, the system is reset – basically the reset button is pushed! Your embedded system reboots and keeps on running. Generally no one even realizes it has rebooted (I’ll discuss that problem in more detail shortly).

WatchdogTimerThis diagram shows the Watchdog timers value which is constantly counting up. Every time the Watchdog is “kicked”, the counter is reset to zero. Somewhere in your code the ZW_WatchDogKick() routine is called which resets the watchdog timer. Sometimes this reset condition happens on a nice regular basis, sometimes it happens at varying times as shown by the level of the timer. The key is the timeout threshold has to be longer than any normal operating condition. If a fault condition occurs, the timer keeps on counting up until the threshold is reached and then the system is reset. When the watchdog timer fires, the Z-Wave chip goes thru a full reset just as if power had been removed and reapplied. Your embedded system is back up and running as if nothing had happened.

SiLabs Sample Code = Minimal Watchdog

The SiLabs sample code has the following implementation of the watchdog:

BYTE ApplicationInitSW(ZW_NVM_STATUS nvmStatus) {
...
#ifdef WATCHDOG_ENABLED
 ZW_WatchDogEnable();
#endif
} 

void ApplicationPoll(void){
#ifdef WATCHDOG_ENABLED
 ZW_WatchDogKick();
#endif
}

The sample code has the good implementation practice of putting the Watchdog code inside #defines so it can be easily enabled/disabled. Unfortunately it blindly kicks the dog every ApplicationPoll without checking any other conditions. ApplicationPoll is called roughly every few hundred microseconds and a lot of fault conditions can exist and ApplicationPoll will still be called. With this implementation the only way the watchdog is going to fire is if there is a catastrophic failure and ApplicationPoll is no longer being called. While this implementation is better than nothing, it won’t reset the system in many cases where the device has become unresponsive. This is where you come in, you have to add more code to the watchdog algorithm. It may be easy to just use what SiLabs provides, but for a robust product you really need to spend some time adding your own conditions to the watchdog algorithm.

A Better Watch Dog Example

Writing good watchdog code requires some significant thought and testing. The possible sources of failure need to be discussed with members of the team and with other Z-Wave developers who are fighting the same fight (thus the need for this blog). I can provide a few guidelines to include in your analysis but this is not a complete solution. Only you know all the possible failure modes of your product and that requires some serious thought and analysis.

Mutex Gets Stuck

The most common failure I have seen is the fact that the SiLabs provided Application Framework (AF) mutex can get stuck. When the mutex is stuck, it most often results in the device still able to receive Z-Wave traffic but often can’t respond. If the device is power cycled, then it returns to full operation. So often this failure goes unnoticed both in testing and in actual use.

What is the mutex you ask? The mutex is a simple flag in the AF that prevents the code from overwriting the Send Buffer while a message is currently being sent over the radio. When a GET command comes in, the AF will call a command class handler to handle the GET and build a REPORT frame in memory. When ready to send the frame, the AF will call pTxBuf=GetResponseBuffer() to get a buffer for the radio to send. There is only one buffer so if the buffer is already in use, you get a NULL pointer back and will have to wait and send the frame later.  This in general works fine as long as frames don’t come in too fast. But in a large network with lots of repeated and re-routed frames you will occasionally get a bunch of GETs quickly and it is possible for the REPORTs to get cross wired and end up locking up the mutex for a frame that will never be sent. If the code then doesn’t properly release the buffer, the mutex is stuck. The Application Framework code is known to lock the mutex occasionally so you must code around this problem. The easiest solution to this rare event is to ensure the watchdog is watching the mutex and simply reboot if it gets stuck for too long.

My solution is to have a counter that counts up once per second in ApplicationPoll anytime ActiveJobs() is true (in SDK 6.81.xx its now called ZAF_mutex_isActive()). ActiveJobs is true anytime a buffer is in use and false when all the buffers are free. There are actually two buffers, one for response frames (REPORTs sent as a result of a GET) and a second buffer for request frames (unsolicited notifications).

Application Specific Reasons

Beyond the mutex you must think long and hard about application specific failure conditions. The most obvious is that the device has not received or sent a frame in 25 hours. Most hubs will poll a device at least a couple of times per day to make sure it is still alive. So if there has been no traffic in a day, maybe something is stuck and a reboot is in order. Plus if nothing has happened in a day then probably no one will notice the reboot (which only takes 1.5 seconds). You do have to be careful that some other part of the application isn’t impacted as a result of the reboot. For example, if you are a light switch and by default you turn the light off on a reboot, then people will be really annoyed if the light randomly turns off because your hub hasn’t polled it in day. There are lots of potential checks you can make here but every application will have different requirements so you will have to think hard about all the possible conditions for your specific case.

Sample good watchdog:

E_APPLICATION_STATE ApplicationPoll( E_PROTOCOL_STATE bProtocolState ) {
...
if (ActiveJobs()) {              // Mutex buffer is busy
    if (OneSecondTimer) ActiveJobsCounter++;  // Once/sec increment
} else {
    ActiveJobsCounter=0;         // When buffer is free clear counter
}
...
if ((ActiveJobsCounter<30) &&       // Mutex isn't stuck 
    (LastCommsHours<25) &&          // Got a frame in the last 24 hrs
    ApplicationSpecificReasons) {   // Other reasons
    ZW_WatchDogKick();              // Everything is OK so reset WDOG
}

In the example code above we do have a major issue in that if the counters stop counting for some reason, the watchdog will never fire! But that’s easy to check for in ApplicationPoll and if ApplicationPoll itself isn’t running then the WatchDog is no longer being kicked so it will reset.

Doesn’t Work If Not Tested

The old coding adage (proven totally true by me many many times) goes “If the code hasn’t been tested, it doesn’t work”. Same thing applies to your Watchdog code. So how do you test the watchdog? The first thing to do is to log the number of times the watchdog has triggered. This has to be stored in NVM since RAM will be lost when you reboot. Fortunately ApplicationInitHW is called with the bWakeupReason parameter which lets you know the watchdog fired when equal to ZW_WAKEUP_WATCHDOG. Note that usually ApplicationInitHW just stores the bWakeupReason and later in ApplicationInitSW we check it as the NVM isn’t available in InitHW.

ApplicationInitSW(...) {
...
if (wakeupReason==ZW_WAKEUP_WATCHDOG) { // Increment WDOG counter with max 255
    i=MemoryGetByte((WORD)&EEOFFSET_NumberWatchDogResets_far);
    if (i<255) MemoryPutByte((WORD)&EEOFFSET_NumberWatchDogResets_far, i+1);
}

Use a Configuration Command Class parameter to read or update this value for testing purposes. I also like to put in a small block of code wrapped in #ifdef WATCHDOG_TESTING_ENABLED that upon receiving a BASIC_SET with a value of 0xDE (not a valid value) calls GetResponseBuffer() which locks up the mutex and in 30 seconds the chip should reboot. If not, then you have a bug in the watchdog code! You can test all the branches in your watchdog code with various values of a BASIC_SET.

When to Enable Watchdog

Perhaps a better question is when NOT to enable the watchdog since ALL production builds absolutely must have the watchdog enabled! My recommendation is to disable the watchdog during development. You want the chip to lock up if you have a bug. The watchdog is really good at masking major bugs since things just keep on working. If the device locks up, then you know something is wrong and you need to chase it down. If you power cycle and the device is fine again, IT IS NOT FINE! You have a bug in your code! During production testing I usually turn the watchdog back on but I also have the testing scripts check the watchdog counter and if it increments then the test fails.

Watchdog Best Practices for Z-Wave Developers

  1. Disable Watchdog during development using #defines
  2. Only kick the watchdog when everything is idle
    1. Kicking every ApplicationPoll is INSUFFICIENT
    2. Check the ActiveJobs() being stuck (aka Mutex)
    3. Check other conditions within your product
  3. Check that the RF has received something every X minutes or hours
  4. Have a way to test the Watchdog during development
  5. Store the number of Watchdog resets in NVM and retrieve them via a configuration parameter