Restarts on just about any scene or Reactor

Space looks OK

What have you got set to change when house mode changes?

C

Not much - a handful of sensors change arm/disarm state, but there are no actions like lights on/off/dim/etc.

I’ve got 206 devices so think it might get hairy doing that through mode changes

We know that Vera generally degrades the larger your Z-Wave network becomes, and ironically, it has been shown that leaving the stacking of (many) multiple jobs in a row will cause her to behave erratically.

May I suggest porting some of those Arm/Disarm toggles over to Reactor (maybe a dozen at a time, until the problem abates), since you can there introduce a [Delay] action between batches.

This advice comes on the heels of a separate discussion in which we all essentially agree that such delays are key to keeping Vera from puking on long chains of actions (i.e. Mode change panel).

If you’ve confirmed it’s house mode changes, try stopping the sensor changes?

Or stop the house mode changes? See if that narrows it down?

C

@LibraSun - yes, and agreed - but my head scratching is that everything was fine a couple weeks ago with the same configs. I had moved a ton of stuff to Reactor months back and it has been noticeably more stable throughout the day. I dare say that I would even go a few days in a row with no restarts. But something must be hosed to create such a dramatic and recent change in stability…

@Catman Yes, that was my next step as well - just delete the house mode change and see if it executes ok.

@rigpapa One of the many things that is so incredible about Reactor is the ability to continue activity executions even factoring in restarts - But I’ve noticed that with this recent issue that even Reactor is not completing some (not all) tasks following a delay and with the restart occurring. Does that shed any light on it?

I’m going to try removing the house mode changes next and report back.

Thanks to all of you guys (and this community in general) for the pro-bono help we all provide each other!

Ryan

1 Like

FYI, this is why I wrote my Replacing HouseModes Plug-In with Reactor treatise last month!

LOL

treatise

1 Like

Ironically - I just did this as my first troubleshooting step. I had the housemodes plugin, and had ‘clicking the button’ as an action in the scene. I removed that and used Reactor’s built-in “change house mode” as a replacement.

1 Like

I would also study your LuaUPnP.log file carefully. If you find a lot of red and yellow in there, and the infamous “got CAN” messages, that’s an indicator that ZWave commands are being dropped or ignored by the engine. There’s no way for Reactor to know that. It speaks the health of your mesh and devices. Depending on the structure of the mesh, a single device changing neighbors or dropping out can big problems that result in delays and other issues, and very often lead to “got CANs”.

2 Likes

The only ‘red’ lines I’m seeing (and not sure how far back to go, this is all within the last 5 minutes or so)
I don’t know what device 277 is, can’t find it. Device 134 is a motion detector.

LuaInterface::CallFunction_Timer device 277 refreshCache took 10 seconds (multiple instances of this entry)

ZWaveNode::HandlePollUpdate_Alarm node 59 device 134 v1type: 0 v1level: 0 source: 0 status: 255 type: 7 event: 0 parms_len: 0 parms: 0 code: (null)

Update that I just ‘caught’ another restart in the log:
01 04/24/20 16:16:30.306 FileUtils::ReadURL 28/resp:0 user: pass: size 1 http://apps.mios.com/get_plugin_version2.php?plugin=4086&accesspoint=50061111&platform=mt7621_Luup &firmware=*1.7.4970*&oem=1 response: <0x7706c520> 01 04/24/20 16:16:30.307 JobHandler_LuaUPnP::GetPluginVersionOnline iPlugin: 4086 buffer empty <0x7706c520>

This restart was triggered by running a scene which contained only one plugin (which is also in the others that have been triggering it) - Harmony.

Edit 2: I have tried Harmony interaction same activity, manually and it all worked fine. Ugh this is driving me nuts.

I’m considering just taking a solid backup and factory resetting everything. I have never seen it this unstable - restarting on small automations and/or not completing them. It would be understandable if I had added a new plugin, reconfigured something in some way but I haven’t.

1 Like

So you’re saying that you’re commanding Harmony at the mode change? I have seen plugins that access an external servers API cause problems when you command them to do something at a mode change. In my cases, I found that to be an inappropriate (or missing) timeout value for network operations in the plugin, or just some incomplete error handling. I’ve seen that cause a reload in one of two ways:

  1. it causes a deadlock situation because Vera winds up with two things waiting for each other, which after 60 seconds leads to a reload

  2. it causes the scene to run too long (several seconds), and Vera reloads to fix the problem

Did anything change in your home networking recently?

So - the Harmony does change on mode changes, but the restart is happening even if there is no mode change. And then the restart doesn’t happen if i just manipulate the Harmony ‘device’ manually. It’s literally like my ‘scene engine’ for lack of better terms, is broken. I can alter all plugins manually - meaning I can change thermostat modes, run Harmony activities, etc. One of the restarts even triggered from turning on a virtual switch which does 2 things: unlocking deadbolt and disarming the alarm.

Side Note:
I think I have run some command at some point that recycles logs faster than normal, because every time I pull them up, they only go back to about 10 minutes ago, with the first entry being:
02 04/24/20 16:42:17.624 Finished rotate logs <0x7713d320>

Now, to your question about home networking -
Yes. We had left the house vacant for a few weeks and internet had crapped the bed, and I had to reboot even netgear switches to get them working again. I wonder if i got a surge or crazy power spike. I have vera, router, NAS, etc in a UPS so didn’t see any issues there, but I had to power cycle like every single thing. Ubiquiti APs, you name it.
I have lots of chromecast audios and in doing all this ‘restoring’ I decided to create a new WLAN and put them on that to get them a little seperate. I also renamed my 5gz from the 2.4. Vera is on LAN directly into the router - and the IPs of other devices, like the Harmony and Alarm panel for example, are accurate and never changed.

What ideas do you have on the networking change?

Interesting … I wonder if instead of a power glitch you had a hacking attempt. If you haven’t already rebooted your router, I’d suggest that. Some hacking attempts leave malicious code in the router’s RAM that gets cleared out with a reboot.

One hacking method is to change the name servers in your router to something else, so that all your network accesses get diverted to the hacker, which intercepts data an passes it on. This tends to slow down access to everything. It doesn’t slow down your connection speed, but it adds a delay to the start of every access. Do you know how to check the nameservers in your router?

With regards to your faster log recycling – were you saving logs to an external USB stick before? Maybe the “Store Logs on USB Device” got unchecked somehow, or the USB stick failed/got corrupted? Vera can’t store as much internally so the logs recycle faster without a stick.

When you look at your log file at the time of one of the reloads, can you find the line that has “LuaUPnP::Reload” on it? What else is on that line?

If you’re only seeing the problem when you execute the commands in a scene, then maybe that’s the “scene ran too long” problem. You can look in your log file and it will tell you when the scene started executing (search for the scene name in the log). How long between the scene start and the “LuaUPnP::Reload” line?

1 Like

Interesting idea -
Yeah I rebooted the router, vera, switches, APs, just about everything connected. Things were in this down state for about 2 weeks so everything had festered for a while. I will say something interesting was that the time was way off for a windows PC that is hardwired (and set to automatically update). Trying to confirm in my router now and it’s off a few seconds, looks like the last response from the NTP server was April 4th and it just says waiting for response.

Regarding the USB logging, I have a RFX transceiver (window shades and temp sensors) so can’t use a jump drive for them.

Might try swapping out some of your Ethernet cables in case one or more took an electrical strike. Same logic applies to one or more router/switch ports in network.

1 Like

Next ‘upgrade’ in home automation is to replace router and net switches… and this very well might be the last straw to do it. I power cycled the router this morning, only thing that would force it to refresh it’s NTP server connection, and it updated it’s time a few seconds. I also took one of the switches out of the equation for the APs connection. Pulled the plug on vera and let it sit for an hour in timeout. I had to partial reset it twice (3x reset pushes in 6 seconds) to get it to establish connection again on the network. Of course I had to synchronize a few things like the ecobee token. That among other things accounted for the 3-5 restarts off the bat, but it’s been stable since, best I could tell.

I wonder if there was an issue with the timers in scenes and the router being off a bit on time? I know time being off can make all kinds of goofy things… so might be it. I’ll know for sure if it makes it 24 hours.

Side note of ‘smart home hell’ or ‘WAF’, etc. - I have a condition that starts chomecasting to 6 speakers throughout the house, pretty fun - even changes the pandora station based on weather temps, etc. Basically the condition is an ‘on or off’ thing, which begins playing when we all wake up, come home, etc. Turns off at goodnight, leave house, etc. At exactly midnight, music began playing like a freaking scene out of poltergeist! Hence vera got put in timeout when I got up and ran my morning scene - which of course triggered several restarts…

3 Likes

You have just lived my nightmare. #death_by_waf

1 Like

Ok, so if I run this group activity (denoted by the red arrow), it triggers a restart. My only ideas that would hose it up would be the variable setting or thermostat mode (which uses the ecobee plugin). I can run both of those manually (denoted by the green arrows) and they both succeed without issue.

Is it possible the delay for 10 seconds is causing the tripping? @LibraSun you mentioned the scene timers issue, but I did run that code…

The question is not whether you can run the commands individually, but whether removing them makes the restart go away. My suspicion is that the Ecobee plugin is communicating with its target devices and something in that interaction is causing a deadlock. In this instance, it may simply be because you are running the activity via the test function, which makes an HTTP request to the Vera to run the activity; if the Ecobee plugin also uses HTTP in its communication, that would be the makings of a potential deadlock.

An alternate way to test this activity would be to toggle the “NOT” group setting on the “Everyone Awake” group. This would cause a reversal of the condition group’s result that will make the group activity run (or not–toggle it back and forth until it does). This is a more “natural” way of running the activity, because it driven by the normal flow of logic and not by an outside HTTP request.