Reactor luup restart every minute?

jdesai61 · August 2, 2020, 2:40am

Something seems to have happened to my Vera and I am seeing frequenty luup restarts (after six plus months of stability). I am seeing the following in Lua log files when searching for Reactor. I have checked all the Reactor created devices and expressions/conditions and nothing seems out of order.

Any hints on best way to debug this issue?
Thanks,
… selected logs from
grep Reactor /var/log/cmh/LuaUPnP.log | grep luup

50	08/01/20 22:18:34.509	luup_log:120: Reactor START-UP! <0x76964520>
50	08/01/20 22:18:34.816	luup_log:120: Reactor: Plugin version “3.7-20190” starting on #120 (“Reactor”) <0x76964520>
…
50	08/01/20 22:19:13.531	luup_log:120: Reactor START-UP! <0x76cb2520>
—	—	—
50	08/01/20 22:19:13.766	luup_log:120: Reactor: Plugin version “3.7-20190” starting on #120 (“Reactor”) <0x76cb2520>
…
50	08/01/20 22:19:44.529	luup_log:120: Reactor START-UP! <0x76b42520>
—	—	—
50	08/01/20 22:19:45.038	luup_log:120: Reactor: Plugin version “3.7-20190” starting on #120 (“Reactor”) <0x76b42520>
…
50	08/01/20 22:20:51.705	luup_log:120: Reactor START-UP! <0x76de4520>
—	—	—
50	08/01/20 22:20:52.147	luup_log:120: Reactor: Plugin version “3.7-20190” starting on #120 (“Reactor”) <0x76de4520>

rigpapa · August 2, 2020, 2:42am

Hmmm… just because you see Reactor starting itself in the logs, doesn’t mean Reactor is causing the restart.

Reactor will disable itself after 10 Luup/system restarts within 15 minutes, unless you have disabled that feature. I will be very surprised if Reactor is causing the restarts, but it will get itself out of the way so you can figure out what is actually doing it.

jdesai61 · August 2, 2020, 3:22pm

Thanks - I didn’t realize that. For some reason I thought Reactor was causing it. I had to disable Reactor going into safety lock down mode (with high thresholds). I will keep looking to see what else is causing it.

ruster34 · August 7, 2020, 3:29pm

So interesting - I am in the same shoes. Sure, maybe 6 days ago sounds close enough. My goodnight and morning scenes were causing upwards of 10+ restarts to complete. I am losing my mind trying to identify the issue. I have Reactor driving all conditions and executing all activities - but I also have several plugins that I’d blame first (and have been trying to determine) but no answers yet. I wish there was some level of observation I could trigger the condition and watch to see what’s hitting the wall.

Last night I thought I uncovered a pin storage issue with the Alarm Panel, so I edited the scenes to include the pin (vs retrieved from said failed storage method) and got my “leave” and “home” scenes to actually complete without restart and thought I found it. But apparently I made it worse with other troubleshooting efforts because overnight, it escalated to include multiple full reboots…

If it helps, here are the plugins that I have in case we have the same one(s) and can narrow it down:

Logitech Harmony
EVL3 / Alarm Panel
Ecobee
House Modes Plugin
Vera Alerts
VeraConnect WWN (Nest)
AutoVera (Tasker plugin)
Darksky Weather

rigpapa’s suite lol:

Reactor
Site Sensor
Rachio
Switchboard
Yeelight Interface

rigpapa · August 7, 2020, 3:30pm

There’s a “run activity” button on every ReactorSensor activity.

ruster34 · August 7, 2020, 3:33pm

Oh right - sorry I mean I wish I could run the activity and watch live log data show what’s unraveling. I can “run activity” and then all I see is the blue notification bar showing the restarts over and over. It’s crazy - like I’ll see only a couple devices change state before another restart, then the next few, restart, etc. Something is killing it

rigpapa · August 7, 2020, 3:34pm

Are you looking at the LuaUPnP log?

ruster34 · August 7, 2020, 3:36pm

Yes but nothing seems to indicate critical failure that I can find before restart and get the 500 error

I will say there are a TON of entries for “AlarmManager” whatever that is

Your ‘logic summary’ shows events way better and cleaner than the log. But still can’t see anything indicative

rigpapa · August 7, 2020, 3:52pm

If you have an activity that reliably causes a problem, I would start removing actions from it, one at a time, and seeing if you can get a positive change in behavior. Of the things you listed, my first go-to would be any actions for the Nest WWN plugin, which I banished from my systems two years ago for being… problematic. Since you’ve got a lot of “AlarmManager” messages logged, the EVL3 actions would be a close second (consider first, since my WWN experience is negative but out of date).

Also, if you put comments in your activity with a “*” as the first character, it will emit the remainder of the string to the event log as a message. This is useful for calling out each device action before you run it (progress breadcrumbs), for example.

The event log, though, gets cleared every time Reactor starts (which it would at reload), so I’m wondering what you’re actually seeing there. My guess is you’re furiously refreshing the logic summary page trying to catch something before it reloads/clears, but that’s got a Vegas chance of you hitting anything useful. There is a way to log the events to a file, which will roll and survive the reloads: set LogEventsToFile on your ReactorSensor to 1, then reload Luup to make the setting change take effect. You’ll then find a file in /etc/cmh-ludl called ReactorSensornnn-events.log, where nnn is the device number of your ReactorSensor.

ruster34 · August 7, 2020, 5:43pm

I love how you built in the ability to log through restarts…
So I enabled that, restarted, entered test time to ‘evening’ (for other aspects of the conditions) and triggered the condition. When it finally completed the scene, I changed the time to ‘morning’ and the morning condition was triggered. Those 2 scenes finally completed 21 minutes and 8 restarts later. I looked through ~1500 lines of event log and nothing stands out. I see reactor restarting, picking up where it left off, lots of mention regarding the test time, but nothing showing issue. The Harmony plugin did it’s thing, the alarm plugin did it’s thing, the ecobee plugin did it’s thing. Regarding WWN, I don’t have any activity to those devices, even with switching house modes (arm/disarm). I only have them to trigger a ‘fire’ scene if they ever alarm out. I can certainly remove that plugin since it’s certainly not used ‘often’ but I’d love to have some ability to react if they did.

I’ve attached the event log file created in case you were bored
ReactorSensor306-events.log.pdf (124.8 KB)

Like you mentioned, I can start removing things one at a time to try to isolate but this is taking hours I don’t have at the moment. Might be faster to remove all plugin activities and then add them back in one at a time.

Also, only because it seems peculiar - I should mention that when I’m looking at the UI during the ‘morning’ scene I notice one device that is turned on (a kitchen cabinet light) seems to be turned on but wait for acknowledgement at the first of restarts. It’s a very loose measure of order, bc I know there are many devices being set to a different state but multiple times I’ve noticed it during all this.

rigpapa · August 7, 2020, 6:09pm

That sounds like a good approach as well. You can also try inserting delays between device types. Your last comment regarding “wait for acknowledgement” suggests that it may be having trouble communicating with the device, and you are perhaps getting the dreaded “got CAN/dongle is in a bad state” and from there the Z stack is getting confused and eventually deadlocking (which is a frequent cause of restarts). Your LuaUPnP log will hold the clues here, and maybe that’s the next thing to start studying.

ruster34 · August 7, 2020, 6:30pm

Ah - I did a quick find on that and found an entry just 2 minutes ago…

1	08/07/20 14:26:28.464	ZWaveJobHandler::ReceivedFrame NONCE_GET flood node 67 <0x761c0520>
01	08/07/20 14:26:28.465	got CAN <0x761c0520>
24	08/07/20 14:26:28.465	ZWaveSerial::Send woke up: ulTime 3585435 ulTime_end 3585464 TimeToWaitInMs 2000 m_pres (nil) m_preq (nil) status 0 <0x76bc0520>
24	08/07/20 14:26:28.465	ZWaveSerial::Send m_iFrameID 1171 type 0x0 command 0x13 got failure 0x18 iNumFailedResponse 1 m_iSendsWithoutReceive 0 numretriesforack 3 <0x76bc0520>
02	08/07/20 14:26:28.465	ZWaveSerial::Send m_iFrameID 1171 got a CAN – Dongle is in a bad state. Wait 1 second before continuing to let it try to recover. <0x76bc0520>

Is this referring to the main zwave dongle on the vera plus or a specific device? I don’t see a device ID or anything…?

ruster34 · August 7, 2020, 8:27pm

I must have a critical conflict in my 10 reactor sensors. I took all the plugin activity out of the mix, disabled all RS. I ran the goodnight scene (only ‘devices’ like lights, windows, etc.) and it ran fine. I ran morning scene same way and all ran fine. I added all plugin activity (media, ecobee, alarm, etc) for both goodnight and morning and they all ran fine again.

Any tips to check conflicts across all 10 sensors?! haaha… The logic summary and status tab is super helpful for each one, but might need to just sit down with a whiteboard lol

The search continues!

rigpapa · August 7, 2020, 8:32pm

You’ll note the first message makes reference to node 67. That’s the ZWave node ID, which is not the Vera device number.

An easy way to find the Vera device is to use a user_data request. Request the following URL in a browser, substituting your Vera’s local IP address where indicated:

http://192.168.0.20/port_3480/data_request?id=user_data&output_format=xml

Use the in-page search feature of your browser to search for the string altid="67". Because of overlapping use of numbers in this data, there may be multiple hits to the search. The one you are looking for will typically also have on the same line the term id_parent="1". If you read the other attributes on that line, it should be pretty easy to find the device name.

Now, another thing to keep in mind is that this device may be fine, it was the device before it that was being acted upon that started the problem. So be sure to look above this in the log and identify that device as well.

Edit: also, if you go to the Reactor master device, Backup and Restore tab, you’ll find a “Logic Summary” link there, too. If you request the logic summary from the Reactor master device, it will give you summaries for all ReactorSensors together.

ruster34 · August 7, 2020, 8:37pm

You’re like an oracle - you know that? This traced it back to an Inovelli device (all of which have had issues with) so it doesn’t surprise me that it is acting up again. Usually a power cycle (plug in module) fixes it.

Now that we’ve sorted that - any ideas on how to cross compare multiple RS logic? I must have a conflict in there somewhere…

rigpapa · August 7, 2020, 8:43pm

Did you see the edit I added to my previous reply? I was trying to grease it in… does that get you started?

ruster34 · August 7, 2020, 8:45pm

LOL I did not, and of course you thought of this. Seriously - it’s odd how well you build things
And we are all grateful you have provided keys, color coding, even flashing, to help us figure out our mistakes haha!

rigpapa · August 7, 2020, 8:47pm

If could add blinky lights, I would. I’m really old school.

ruster34 · August 7, 2020, 8:55pm

lol I love it! I went ahead and sent the logic summary - I’m looking through this over the weekend, but based on the formatting requests, you probably have a macro that can scan through it, so figured I’d send it if you were willing to check it out.

Thanks as always for everything. Hopefully I’ll find something sooner than later

rigpapa · August 7, 2020, 10:16pm

I poked through the Logic Summaries, and at first glance nothing really jumps out, but there are a couple of things you may want to investigate.

First, the big scenes like Goodnight… I have a similar function, but I’ve taken a different approach, not by design, but more by accident because that’s how I was thinking about my use at the time that I did it, which is before Reactor existed… what I do is I have a master off scene for each level of my house. I also have an off scene for each room, unless the room has a really small number of devices (e.g. one). Each “level master off” Vera scene does a RunScene (using scene Lua) of the individual room off scenes on that level, and only the single-device exceptions are done directly by the level master scene as needed. I then have an “indoor off” and “outdoor off” scene; the indoor off scene runs the per-level off scenes (again using RunScene in scene Lua). So in execution, there’s a spread of multiple jobs because multiple scenes are being run. I think this has a big effect on Vera’s timing… there’s something undefined but palpable in the way Vera runs scenes. I think having the additional jobs, which are put through Vera’s scheduler, better spreads things out and slows them down, which gives the system pauses where it can breathe, as opposed to slamming it with one giant scene.

Before going down that road, I would try this simple experiment: there’s a switch on the “Run Scene” action in Reactor activities where you can have Reactor run the scene, or hand it off to Vera/Luup to run it. Having Reactor run it is usually faster, but for a large scene like this, the pace of Reactor’s instructions may just be overwhelming for Vera Luup’s scheduling and Z-wave stack. I think it will pace differently if you let Vera run it itself; see if you notice a difference. If letting Vera run the scene is more stable, then I would just go that route — it may come with a little performance penalty, but for an “all off” scene, that shouldn’t be much of a concern.

If that doesn’t smooth out those larger scenes, then I would try splitting them up as I have done, or by some similar approach.