Breakthrough made in improving Z-Wave reliability - major issue resolved

EDIT: The following post created a flurry of inaccurate information on some blogs, and had the opposite effect from what it was intended. All software has bugs. All software gets updates to fix those bugs. This is true of everything from your iPhone to your Mi Casa Verde Vera, and even the firmware used in the Z-Wave chips. The bug discussed below is NOT a hardware problem with the Z-Wave chip, it’s a problem with the firmware (software) in the chip. The problem is solved by updating the firmware in the chip. It’s not a big deal. It’s not the first time. And the issue only effects the controller (ie Vera); none of the other nodes require any updates. Like every firmware, there have been bugs before, and there will be bugs after, and there will be more firmware updates in the future. Software is always a work in progress. This post was intended to calm the some users in our forums who were pushing for answers to some routing anomalies. The goal was to relay the great news that Sigma took them seriously, investigated them, found the problem and that a fix was forthcoming. Some blogs blew this completely out of proportion and are asserting that there’s something fundamentally wrong with Z-Wave chips. I’m very sorry if this post led them to that wrong conclusion. We stand behind Z-Wave 100% and feel that it’s by far the best home automation networking system on the market, and that with this newest fix, it will only get better.

SHORT FORM: Last week a bug was discovered in the firmware in the Z-Wave chip itself that is likely the cause of some users complaints that their Z-Wave networks aren’t reliable. Sigma, who makes the Z-Wave chips, is publishing an upgrade. We held up the release we promised last week to get this fix included. The Vera 2’s Z-Wave chips will be upgraded automatically with our firmware release, which we will be pushing out to our beta testers in a couple days. For Vera 1’s it will take a while before we can publish a Windows utility that is able to upgrade the dongle’s Z-Wave chip.

LONG-WINDED DETAILS:

There was some chatter in the forums about our recent face-to-face with Sigma’s Z-Wave engineers in Denmark. So we wanted to fill everyone in on the developments and explain why the next release hasn’t been pushed out yet.

Every Z-Wave device, including Vera, your door locks, light switches, etc., has a Z-Wave chip in it, made by Sigma Designs. Z-Wave is a “mesh network”, meaning messages are relayed from one module to another until they reach their destination. All the logic of routing messages from one node to another are handled by the Z-Wave chip itself. This is Sigma’s “secret sauce”, and the source code for this is not available to Mi Casa Verde or anybody else in the Z-Wave alliance. Only Sigma has access to it.

So when you tell Vera to turn on your light, Vera sends a message to the Z-Wave chip inside Vera saying “Here’s a message for the light, node #5”. The chip handles all the routing and transmission and returns to Vera simply a ‘success’ or ‘fail’. If it fails, Vera has no information about the reason. We are not able to ask what route was used for the message or any other debug information. The only way to get debug information is to go on-site with a special tool called a “Zniffer”, which is a monitor for Z-Wave that captures the traffic. Sigma does not allow Zniffers to be given to anybody but alliance members.

Over the past couple years we’ve had various complaints about reliability. While obviously most issues ended up being in our software and we could fix them, when it comes to the reliability of Z-Wave itself, there’s nothing we could do about it other than send one of our engineers on site with a Zniffer to capture data. On many occasions, while in customer’s homes with a Zniffer, we’ve observed that the Z-Wave chip is using dead-end routes and the messages aren’t reaching their destination. This results both in Vera not being able to control a device, as well as a device reporting an event, like a door lock being opened, and the event not making it back to Vera. We’ve forwarded these Zniffer logs to the local U.S. Sigma reps, but nothing had ever come from it. Until last week we were never able to get a face-to-face meeting with Sigma’s core Z-Wave coders in Denmark because we were the only member of the Z-Wave alliance complaining that routing wasn’t always working. We’ve long suspected that’s simply because we have a lot more power users with large installations and we do more low-level debugging than our competitors.

As was previously mentioned, from August to November we shifted much of our attention on some large OEM deals, which involve a lot greater quantities. This is why firmware updates weren’t being published during this period. Fortunately, this higher volume gave us greater leverage so that Sigma took the unusual step of giving us face-to-face time with their engineers in Denmark. During these meetings it was uncovered that there is a problem.

When you pair a Z-Wave device, like a light switch, the Z-Wave chips in Vera and the light switch assume they are in direct range and can communicate directly without sending messages on hops around the network. Once you’ve finished pairing your devices and put them in their final places, you’re supposed to do a ‘heal network’, or ‘repair network’, which causes every Z-Wave node to discover which other Z-Wave nodes are in proximity so it can figure out what nodes to use to relay. This is called the routing table. What was discovered is that during this discovery process the Z-Wave chip was not clearing out the routing table showing what nodes were in direct range. It was adding to it only, without first clearing it. The net result is that the Z-Wave chip never got a clean routing table, and always assumed every node was in direct range. Therefore, if the node wasn’t, it would try “hopping” or “routing” on the mesh network, but since it thought every node was in direct range, the routing was more random than logical. It may choose routes that are dead-ends.

This issue is now confirmed as “3066”. This is not unique to Vera. It’s a universal issue that effects every Z-Wave controller with the 5.02sp3 firmware. Sigma will be releasing a new Z-Wave firmware that fixes this. The Z-Wave chips in Vera 2’s are field upgradeable. So once this is available, your Vera will automatically update the Z-Wave firmware the next time you update Vera’s firmware. To get around this until that occurs, Sigma revealed to us some undocumented “secrets” about how to directly manipulate the memory in the Z-Wave chip to manually clear out the routing table.

We finished implementing this, and will be pushing out a new firmware to the beta testers within a couple days, and to the public shortly thereafter. When you get this new firmware you will want to do a ‘heal network’ again. The first time you do, it will take an extra long time because your Vera actually does the stress test twice; once before applying the workaround, and once after, and it sends us a report with the before and after results. We will post a follow up after it’s been out for a couple months letting everyone know the real-world impact this had in terms of reliability and speed. With this new firmware you may also notice a speed improvement because during our testing Sigma uncovered a bug in the Linux kernel drivers for the CPU in Vera. This may have resulted in some delays. Since we don’t have that expertise in-house, we’ve hired a Linux kernel developer as a consultant and he has fixed this problem too.

There are a lot of fixes in this upcoming release. We’ll try to get it out to everyone as soon as we can.