Trying to push the limit of interval between luup reloads and addressing the seemingly randomness by which these reloads occur. I am now focused on the zwave specific behavior. A number of us have over the years abandoned the vera I believe specifically because of the lag/luup reload on zwave.
I am starting this new thread about an old problem to see if we can understand this behavior better and how to mitigate it. I am fully aware that it is a fundamental design problem which hopefully will be addressed by the new eZLO platform but in the meantime…
Note that there is a number of reasons why luup reload and I have addressed/eliminated practically all of them except for the zwave timeout from what I can tell in my logs:
-ip network (isolated the vera, disabled UPnP discovery, disabled NetworkMonitor)
-cloud server (Eliminated all cloud server bonding to mios)
-plugins (moved them all to openLuup)
-DRAM (monitoring at OS level)
-data corruption/storage (extroot)
-CPU (testing on a emulator with much faster CPU)
My zwave network has 143 nodes.
For this hammer test, I have set my HEM to report wattage every 10s.
I have a number of zwave outlets which also report energy every few seconds.
All the device actions are sent from openLuup.
1 Thermostat, 12 zwave vents and 6 zigbee vents are taking an action every minute due to my HVAC automation.
The test is running a code to turn on and off every light and dimmer, 40 of them, one by one from openLuup with 5 lines of code (not using the all on/all off zwave multicast) so I am attempting to flood my zwave network to see where it breaks and I am tailing my LuaUPnP.log file, grepping the keyword “tardy”. I noticed that I get a Luup reload with exit code 137 every time the tardiness in the log exceeds about 320. That exit code is the LuaUPnP causing itself to reload because it had a “timejump”, detecting from what I can tell that a callback hung up for more than 5 minutes.
The two exit codes 245 and 137 are the only Luup reload cases I have seen in my logs for the past several months so I believe that zwave is the main source of instability.
The result is quite disturbing. The test definitely stresses the zwave network and causes some delay and hangups of the zwave dongle but… It still appears random.
I can see the tardy log report go up and gradually grow from low to high numbers. At every device reporting back that action completed, I can also see that tardy number come down some and then continue to go back up until all devices complete their action.
What is strange is the range of behavior I am observing.
Best case I saw was all lights turn on and off within 5s and I see no “tardy” in my logs. This often happen just after a luup reload.
Worst case is the zwave network appears to hang up causing “tardy” to go above 300 and then I get a luup reload. This occurs more frequently than the best case.
I have seen every case in between and these are the most frequent results:
-Sometimes I see it hang half way, after successfully activating some devices within 1s and hang for a few seconds seeing the “tardy” number go up to 10-20 or even >100s and then either gradually come back down as devices report back one by one or “tardy” reports suddenly drop and disappear as all remaining devices seem to report back within 1s-2s after having hung for some time.
-On the UI the typical message I see for devices hanging is “waiting to send with ack” if the device has not responded at all or waiting for response after x retry (oftentimes the device action occurred but the device report was not received or processed).
Just looking at “idle” behavior I am still seeing “tardy” occasionally going up 4-20s and rarely to >100s but once every few days it exceeds 300s and causes a luup reload. I have seen the vera go a few hours without any “tardy” report as well.
This makes me suspect that the issue is not just related to how busy my zwave network is but more in how the vera is communicating with the zwave dongle.
Any ideas or input even for testing is appreciated.