Vera Zwave Command Queue Limit

rafale77 · January 15, 2019, 3:19am

So I am down to, I believe, my last grief against my vera and have the firm intention to address it one way or another: The command queue.

I now have eliminated every source of luup reload and reboot and am down to 2:

The monthly 1st of the month at midnight which, I guess I can live with
The swamping of the command queue.

Given the amount of devices I have, I have some scenes which actuate a very large number of devices. They never seem to cause a reload by themselves but if the scene runs and lags/struggles to complete and I add a few manual commands either from my phone or the UI, I am guaranteed a loss of responsiveness and a luup reload. Now my question is this (Sorin if you could please find out): What is the command queue limit of the vera?
I know it may not be a straightforward answer but I want to know how many commands I can send it at once (within 100ms) and then with 1s. I intend to use openluup to stagger the actions in my scenes to prevent a crash. I don’t think this is documented anywhere.

Sorin · January 15, 2019, 3:00pm

Hi Rafale,

Had a quick discussion with the guys in the office. So the thing is if you actuate any numbers of devices through a scene, the z-wave chip is sending a single cast command which depending on the number of devices, might turn out to be slow if you have mesh bottlenecks or secure z-wave devices. On the other hand , Turn All ON/OFF, command from your Dashboard, sends a multi cast command to nearby nodes(nodes directly linked to vera, not the routed ones), which should be much faster and not cause a queue. So if you somehow can leverage this function through some kind of scene, you might partially resolve your issues. Otherwise, try to figure out if you have any bottlenecks in your location (search for hourglass effect on z-wave), or even try to trigger these scenes gradually with delays between them. Or offload some of the devices on a secondary controller.

Z-Wave Job queue is handled by the Z_wave chip itself not the Vera unit so I can’t tell about any “limits”, but if there will be limits I think they are measured in kbps of the bandwidth, and the particularities of each z-wave device. (ie.9.6 to 100kbps for z-wave+). You need to know how many kbps a command takes multiplied by the number of devices, and if it exceeds the 100kbps the bandwidth of a z-wave plus chip, you can find out your headroom on this.

therealdb · January 15, 2019, 3:49pm

Hi Sorin. Is there a way to send a multicast “switch off” command only to a group of switches? I can’t send to all of them, but maybe it’s possible to include only selected devices IDs. Thanks.

rafale77 · January 15, 2019, 4:40pm

Sorin, Thank you. I indeed forgot about that multicast call which is in the zwave SDK of all on and all off which maybe useful for a couple of my scenes.
As for therealdb for most of my problematic scenes, I actually need partial and not all on and off.
The multicast is actually a single call transmitted throughout the network to which the zwave device is set to respond or not. You would have to manually program the device if it supports it so the decision to obey that command is within the device. Not in the vera.

I beg to differ on the handling of the command queue. I read the zwave SDK documentation and it is clearly handled by the host controller. The Zwave chip should not handle the queue. I have for proof controllers like open-zwave, the sillabs pc controller and Zway which both have a UI page showing the queue of command with the wait for reply and the time it has waited before the command is sent. I believe there is some very short queuing in the network as I have observed from delayed sensor reporting but it is not how it should be designed.
See this document attached:
https://www.silabs.com/documents/login/user-guides/INS13114-Z-Wave-PC-Based-Controller-v5-User-Guide.pdf
https://www.silabs.com/documents/login/user-guides/INS13954-Instruction-Z-Wave-500-Series-Appl-Programmers-Guide-v6_81_0x.pdf
https://www.silabs.com/documents/login/user-guides/INS12350-Serial-API-Host-Appl.-Prg.-Guide.pdf

From the sil labs Host appl prg guide:

"The host SHOULD queue up requests for processing."

If the vera is not managing the Job queue and is only a direct passthrough, this might be the problem!! The vera maybe completely swamping the network. I did not see this problem when I tested Z-way on which I had over a 1000 commands in the queue and was processing I believe 6-10 at a time to the zwave chip while waiting for the ACK to send the next one. I understand also that my question is not easy to answer as you have device polling enabled by default which in large networks is not recommended. I have mine disabled.
You are correct on the bottleneck as for example if a device is slow to respond, I have observed the Zway queue to keep that wait for ACK which then occupies a spot in the queue preventing a command to be sent. If the vera does not queue the commands then I may have actually found my answer and as a rule of thumb I may need to limit the number of zwave commands per scene to ~10 and go from there.

Note that even if the Luup engine does not crash, the vera completely misses sensor signals during that queue exceeded period so my sensor states are not updated.

Selkirk · March 19, 2019, 3:52pm

Another alternative explanation …

A single device with poor connectivity can block the entire queue until the connectivity issue is resolved. This might mean waiting tens of seconds for timeout or a few seconds for a retry on that device to succeed.

This is easily reproducible as I detail here.

http://forum.micasaverde.com/index.php/topic,62343.msg354955.html

The blocking depends on which command is attempted. In that thread I can show setTarget blocking for a binary device, but setTarget does not block for a dimmable device. I’m speculating that setLoadLevel is correctly implemented with retry and error handling while setTarget has a bug that causes it to block and that on dimmable devices setTarget uses the setLoadLevel logic instead.

Anyway, the gist is that not every command implementation has robust handling of connectivity failures and that these failures unnecessarily block the command queue, potentially for a considerable amount of time.

intveltr · March 19, 2019, 4:21pm

I’m not 100% sure that it is even related to connectivity issues. I have a couple of scenes that actuate large numbers of devices, and they would frequently cause a luup reload even though all devices were functioning well. After breaking up these scenes into parts with short time delays between them, the reloads have completely stopped.

wilme2 · March 22, 2019, 8:26pm

So is this best practice? To disable polling? I have about 60 z-wave devices (plus another 12 z-wave child devices), plus another 20 EnOcean devices, plus 20 DSC alarm devices, PLEGs,etc.

rafale77 · March 22, 2019, 8:43pm

So is this best practice? To disable polling? I have about 60 z-wave devices (plus another 12 z-wave child devices), plus another 20 EnOcean devices, plus 20 DSC alarm devices, PLEGs,etc.[/quote]

This is the way I think about this: What is the purpose of polling? To check if the device is still in the network.
There are 3 types of zwave devices:

AC powered. (light switches, bulbs)
FLIR: Battery Powered but constantly on because they need to be ready to receive a command. (example: locks, vents)
Battery Powered and normally sleeping (usually battery powered sensors)

The vera by default wants to poll all 3. Polling type 3 is moot since they can’t respond until they wake-up. I think the vera just ignores these polls.
Polling Type 2 is useful only to check their battery conditions but there could be automation based on the presence of this type of device if they are mobile (I don’t know of too many). Essentially polling these devices would just consume their battery for not much value added. Locks usually get polled right after a command is sent anyway.
Type 1: Not sure why you would want to poll these since they are normally fixed and constantly powered. Maybe if you want to detect their failure but the vera is more likely to fail than they are so… why bother?

My conclusion has been that I want to disable automatic polling on all types of devices since polling consumes power on FLIRs, have no value on AC-Powered devices and is actually not doing anything for battery powered devices. There are rare exceptions of FLIRS and AC powered devices which do not update status on their own and require the vera to poll them (example : instant light switch status) but they are rare and usually older devices. I believe vera implemented auto polling for these specific devices but have generalized the behavior to all devices which is one of the many mistakes plaguing UI7.

wilme2 · March 22, 2019, 9:44pm

I still have some of the early Jasco/GE light switches that were non-instant status, although I am slowly replacing them with the latest Jasco/GE light switches. So I don’t want to eliminate polling, but you got me thinking about dramatically scaling back the polling frequency.

Back to your original question, and I am going out on a limb here, but I have some anecdotal evidence that the swamping of the z-wave command queue is partially a factor of which devices are sending the commands. Meaning if a single device is sending the requests, then you have a hard limit of three. One can be queued, then a second, but if a third is sent while the other two are still in queue - reload. This I somewhat unscientifically tested with a single giant PLEG, and it seemed to hold true. Then I split the single PLEG into multiple PLEGs and sent the same commands at the same frequency - and had a fraction of the reloads. THEN I structured each of the small PLEGs to be what call ‘confirmative’ - that is they don’t send the next z-wave command until they essentially get confirmation the previous one completed (by adding into each condition that it can not be true if another condition is currently true), and the reloads all but stopped. I am taking something RTS said out of context, and adding my own experiences, and YMMV.

http://forum.micasaverde.com/index.php/topic,31345.0.html
http://forum.micasaverde.com/index.php/topic,33445.0.html
http://forum.micasaverde.com/index.php/topic,36757.msg273838.html#msg273838

rafale77 · March 22, 2019, 9:52pm

Thanks for this.

This somewhat confirms my reading of the zwave documentation. The controller host is supposed to be the one queuing the commands. Not the zwave radio. The chip itself has a very short queue and polling is a command which the vera necessarily abuses if you let it configure devices by default.

For your devices, you can setup specific devices to be polled and others not to. The setting on the zwave menu screen actually does nothing for devices which have already been configured. Funky thing is, the vera wants you to reconfigure the device when changing its polling frequency which is another bizarre requirement.

tomtcom · July 15, 2019, 1:47am

When disabling polling from the zwave settings, I found the device poll settings of 60 seconds was ignored.

I feel the overall response of my system and luup reloads is exactly due to polling.

Should I enable polling but 0 out the settings in zwave settings to retain individual device polling settings?

It’s amazing this many years later to still have to muck around with this particular setting.

rafale77 · July 15, 2019, 3:46am

You can try both. It sounds from your comment that there are even two polling mechanisms which is quite possible but disabling polling in the zwave menu alone does not eliminate all polling. I don’t know what you mean by retaining individual device polling settings. In my experience, if they are set as anything but 0 the devices will poll. I personally recommend disabling all of them except for the specific devices which require it.

tomtcom · July 17, 2019, 1:28am

Interestingly my luup was solid for a while on the newest firmware then seemed to degrade over time. Had luup uptime of almost 11 days, then the restart at the end of the month and gradually downhill.

When I disabled polling globally I had a strobe set for 60 seconds poll and it never polled.

Before the end of the day I turned it back on because devices weren’t updating if I didn’t use the actual switch.

A strobe connected to a motion sensor went off which flashes for a bit but Vera registered it as on for several hours. With polling enabled it will show off after the usual wait time.

When I globally turned polling back on I set the 4 options below it to 10 seconds greater than the default. Didn’t make things any better.

So from you response do I need to reconfigure the device or update neighbor nodes if I turn off global polling and set polling individually?

It does seems to me not all polling follows the global values. When I set the strobe to 0 for polling while global polling was on the strobe still pulled every 9 minutes in the logs.

Frustrated since 2015.

rafale77 · July 17, 2019, 4:42am

I wish this was better documented too… In my setup, I only have a couple of in wall switches which do not support instant status and therefore benefit from continuous polling. I therefore disabled global polling but noticed that individual devices were still being polled so I disabled polling via lua code (it is a device variable) for all non battery powered devices and FLiRS followed by a luup reload. Upon reload all these devices needed to be reconfigured… Which is easy because they are all powered and do not need a manual wakeup to be polled. Now this streamlined the traffic on my network a lot… You do not need to update neighbor nodes though the vera does it automatically I believe when your reconfigure the device.

Now on your system going downhill… Have you looked at potentially a dying device somewhere? Logs would be helpful to look into and see which device could potentially be causing the reloads.