Zwave Tardiness... And Luup reloads

rafale77 · July 19, 2019, 4:00pm

Trying to push the limit of interval between luup reloads and addressing the seemingly randomness by which these reloads occur. I am now focused on the zwave specific behavior. A number of us have over the years abandoned the vera I believe specifically because of the lag/luup reload on zwave.

I am starting this new thread about an old problem to see if we can understand this behavior better and how to mitigate it. I am fully aware that it is a fundamental design problem which hopefully will be addressed by the new eZLO platform but in the meantime…

Note that there is a number of reasons why luup reload and I have addressed/eliminated practically all of them except for the zwave timeout from what I can tell in my logs:
-ip network (isolated the vera, disabled UPnP discovery, disabled NetworkMonitor)
-cloud server (Eliminated all cloud server bonding to mios)
-plugins (moved them all to openLuup)
-DRAM (monitoring at OS level)
-data corruption/storage (extroot)
-CPU (testing on a emulator with much faster CPU)

My zwave network has 143 nodes.
For this hammer test, I have set my HEM to report wattage every 10s.
I have a number of zwave outlets which also report energy every few seconds.
All the device actions are sent from openLuup.
1 Thermostat, 12 zwave vents and 6 zigbee vents are taking an action every minute due to my HVAC automation.

The test is running a code to turn on and off every light and dimmer, 40 of them, one by one from openLuup with 5 lines of code (not using the all on/all off zwave multicast) so I am attempting to flood my zwave network to see where it breaks and I am tailing my LuaUPnP.log file, grepping the keyword “tardy”. I noticed that I get a Luup reload with exit code 137 every time the tardiness in the log exceeds about 320. That exit code is the LuaUPnP causing itself to reload because it had a “timejump”, detecting from what I can tell that a callback hung up for more than 5 minutes.
The two exit codes 245 and 137 are the only Luup reload cases I have seen in my logs for the past several months so I believe that zwave is the main source of instability.

The result is quite disturbing. The test definitely stresses the zwave network and causes some delay and hangups of the zwave dongle but… It still appears random.
I can see the tardy log report go up and gradually grow from low to high numbers. At every device reporting back that action completed, I can also see that tardy number come down some and then continue to go back up until all devices complete their action.

What is strange is the range of behavior I am observing.
Best case I saw was all lights turn on and off within 5s and I see no “tardy” in my logs. This often happen just after a luup reload.
Worst case is the zwave network appears to hang up causing “tardy” to go above 300 and then I get a luup reload. This occurs more frequently than the best case.
I have seen every case in between and these are the most frequent results:
-Sometimes I see it hang half way, after successfully activating some devices within 1s and hang for a few seconds seeing the “tardy” number go up to 10-20 or even >100s and then either gradually come back down as devices report back one by one or “tardy” reports suddenly drop and disappear as all remaining devices seem to report back within 1s-2s after having hung for some time.
-On the UI the typical message I see for devices hanging is “waiting to send with ack” if the device has not responded at all or waiting for response after x retry (oftentimes the device action occurred but the device report was not received or processed).

Just looking at “idle” behavior I am still seeing “tardy” occasionally going up 4-20s and rarely to >100s but once every few days it exceeds 300s and causes a luup reload. I have seen the vera go a few hours without any “tardy” report as well.
This makes me suspect that the issue is not just related to how busy my zwave network is but more in how the vera is communicating with the zwave dongle.

Any ideas or input even for testing is appreciated.

richie.digital · July 19, 2019, 5:19pm

try changing the 10s up to like 30s - 60s just to see if that would affect the uptime

rafale77 · July 19, 2019, 5:22pm

10s was just for my test. I am keeping it there for now. My normal setting was 30s… I will see if it makes things worse but if for anything, my stress test somehow 10s felt like it would reload less often. Remains to be confirmed by data though.

richie.digital · July 19, 2019, 5:25pm

oh ok , i did something like that and it cause devices to stop working and luup reloads but that was on the old firmware

rafale77 · July 19, 2019, 5:48pm

Yeah so far so good, I have my wattage reporting every 10s and my accumulated power reporting every 60s. We will see how long this will last but it seems more difficult to get it to hang and reload since the change. The correlation to the zwave network is not linear which is why I don’t think it is the main factor. I think the main bottleneck is somewhere between the zwave serial API and the LuaUPnP program.

rafale77 · July 19, 2019, 8:59pm

Well just had another reload due to a scene turning off some lights and digging through the logs… I am starting to get really fume:

I can see now that while the tardy loopback is building up to very high numbers, the zwave chip seems to be functioning correctly and is reporting sensor trips from other devices, status updates for power and energy updates even from the same device the vera is trying to turn off. Then at about 300s the watchdog kills the luup engine with the “TimeJump” without even attempting a retry. So it appears that there is a condition under which the vera does not retry but instead lets its command timeout. Getting to the root cause:

The LuaUPnP randomly drops a message to and from the zwave serial API. It appears random to me because I have no visibility over what is going on. The drops goes both ways. Sometimes the network does not see the command from the vera at all. Sometimes the command went out (the device actually executes) and the ack is missed by the vera.
In response to this miss, most of the time the vera will retry. After 5 failed retries the vera flags the device as “cannot contact the device” but why ohh why is there a case where no retry is executed and instead the vera allows the command to timeout and trigger a luup reload??? This is a massive bug.
The luup reload takes way too long. Most controllers come backup within 1 second or 2. The vera takes 40-45s, interval during which many reports and commands are missed. To @rigpapa point, you should not be reloading the entire engine on purpose.

@Sorin , let me know if you need my logs and would really appreciate if you could draw the dev’s attention to this post. There was some improvements to the zwaveserial 1.0-2 but the problem remains. At one point I thought I had a hardware issue with the zwave dongle or network issue but given that it is still reporting all the devices (including the one to which the command was sent) it is really a problem within the luup engine either not sending the command or missing the ack or response from the dongle.

rafale77 · July 20, 2019, 1:58pm

Found interesting that the tardy phenomenon appears to be much worse on the newer 6.81.5 SDK. I just ran the same test on 6.81.3 zwave firmware and it does not appear to be showing the same issues…

rafale77 · July 22, 2019, 4:51am

Can confirm with certainty now that the issue was from the new zwave version. The difference is night and day though the problem still exists on the older SDK and remains the main source of luup reloads, the frequency of occurence is much lower. ~10-12hrs for the new SDK Vs. random at 10~14 days on the old one.

I am speculating that it could be due to the lack of command queue on the luup engine dumping the commands to the zwave dongle firmware as they come. The zwave SDK documentation clearly state that the command queue must be in the host and the zwave dongle cannot be relied upon for command queuing. The new zwave firmware possibly could have gotten rid of the very short command queue it had to accommodate new features.

rafale77 · July 26, 2019, 6:51am

I believe I have solved the tardiness issue and it might be my own wrong doing.
At one point I had added a secondary controller, a zway thinking I would eventually switch over. That controller created additional associations to itself when it configured these devices. Then I recently decided to exclude it as I was doing nothing with it. Just yesterday, I re-added it back rather for testing and diagnostic purpose and saw all these associations to an excluded controller id. I deleted all these associations (very easy to do with zway) and have not seen any of these tardy alarms in my logs which occasionally were causing luup reloads.

rafale77 · July 29, 2019, 4:46pm

I believe to have now closed the tardiness saga… After redoing the lifeline associations as discussed above, I had a significant decrease in these “tardy” alarms in my logs but still not completely eliminated. Every few hours I would get some 4-60s reports during which the vera zwave devices would freeze while secondary controllers would work fine on the same network. Very rarely, they would cause a luup reload. Analyzing more logs lead me to find that the calls triggering these tardy were communications to my ZTS500 thermostat. Digging deeper into it, I found out that there were situations where too many calls were being sent to the thermostat causing it to ignore frames for some period of time. The calls were from my HVAC automation custom plugin. I have changed some code preventing a command triggering the fan to turn off even though it is already off and it appears that I have not seen any tardy alarm in my logs since. Bottom line is that some devices in your network can intermittently cause luup reloads if you overwhelm them with commands and they don’t respond the way the vera expects them to. Reducing unnecessary traffic on your zwave network is paramount for stability.

tomtcom · July 29, 2019, 7:43pm

Hi Rafale77

How do I narrow down this investigation? The best I have is a macro enabled Elvira Excel file for better viewing and filtering of the log.

But many times I look at the lead up to the luup reload and find nothing of significance to point me in any direction.

Should I run debug on the log settings for some time?

rafale77 · July 29, 2019, 8:23pm

Excel is a bit too tedious.

I use two tools: On a Mac, whenever I get a reload (I know because I get a notification on my phone through a code in the startup lua) and I happen to be home, I SCP into the vera and pull the latest log. Then I use Atom (text editor) which enables me to very easily do searches and go to the point where the reload occurred.

Then in extreme case… like it’s been the past week, I have a desktop on all the time with an SSH tunnel into the vera and I run a grep on a tail for the log. This breaks every time the log rotates but I let my log go 12hrs before rotate (as you know I am either extrooted or on my emulator) so I don’t have much space limitations.
As an example:

tail -f /tmp/log/cmh/LuaUPnP.log | grep “tardy|got CAN”

This enables me to see the log entry live whenever one of the terms I am looking for shows up. In the example above it is the “got CAN” and “tardy” which told me that the tardy had no correlation with the got CAN event.

Back to the original problem. I still think that this is an error handling issue by the vera in the command queue not being very robust. In spite of eliminating the main source of command collision to the thermostat and I am doing a victory dance about it. I still see once every few hours a tardy for 4s or 6s. It is nowhere as bad as where it used to be and I don’t think it can reload but still… It maybe a error handling in itself but I still get nervous over a long time period.

rafale77 · July 30, 2019, 3:40pm

Found interesting that I had a lot of tardy with relatively low numbers (all below 30s and mostly below 10s) overnight during the nightly heal. They started at 2am and stopped at 4:30am so it is a sign that there is a delayed response from the wave dongle.

akbooer · July 30, 2019, 5:56pm

Wasn’t doing a backup of the Z-wave network at the time, or anything like that?

rafale77 · July 30, 2019, 6:13pm

Nope. the backup only takes a couple of seconds. Not enough to show on the tardy report as it shows only when the callback is delayed by 4s or more and I get my backup done at 5am… The zwave chip does its own heal and keeps time… It gets it from host. As a side note, I can see from zway that there is even a zwave call to sync time.

HSD99 · July 30, 2019, 8:27pm

Is there a log entry for the end of the nightly heal? I didn’t see anything that looked likely.

rafale77 · July 30, 2019, 8:35pm

Because the heal is not triggered by the vera, technically no. It is the zwave chip doing it on its own. What it really appears to be is a signal to each node on the network to update neighbor nodes. So the zwave chip queues and tags node for healing and then untags them when it got a message that it is done. In the log you should see the “heal” tag which could be nil or non nil.

example:

02 07/30/19 13:40:00.038 ZWaveNode::Wakeup did a poll for 205 1564519199 seconds interval 0 existing (nil) heal (nil) <0x7ef80520>

HSD99 · July 30, 2019, 9:06pm

I see the start of the heal with the "Nightly: Maintenance log entry. I’ll take a look at tonight’s logs.

rafale77 · July 30, 2019, 9:14pm

You are correct

02 07/30/19 2:00:00.101 e[33;1mZWave::NightlyMaintenancee[0m <0x7ef80520>

rafale77 · July 30, 2019, 9:21pm

You can look up “update nnu” and you will see the commands from the heal process complete. (update ‘neighbor node update’) you will notice that they are sent in increasing id order.

I am also getting a ton of dropped frames due to collisions during that time and the job finally completes.

Hmm and the fact that they have job ids in the vera make me think that the maintenance is actually triggered by the luup engine.