Verabridge command queuing idea

rafale77 · January 15, 2019, 5:13pm

Hi AK,

I posted in the general section a question on command queueing assuming that the vera was doing some and came to realize after thinking further about it, that it probably isn’t managing any command queue unlike most other controllers. I could workaround it using scenes but I have come to wonder if this could be done within the verabridge which would be more elegant and could allow multiple scenes to not cause a zwave overload.
My idea would be to only allow for a fixed number of verabridge device commands and wait for that device to give any feedback before sending the next one essentially keeping say 8 or 10 commands to the vera active and waiting for feedback at a time. Thoughts?

akbooer · January 16, 2019, 4:22pm

Ah, you’re trying to get openLuup to do the ZWave queuing that Vera should be doing, but isn’t?

...wait for that device to give any feedback before sending the next one

So therein lies the problem. I’m not sure what could be used as ‘reliable feedback’, and so then we get into our own issues with queuing and timeouts and repeated commands, … ?

rafale77 · January 16, 2019, 4:58pm

I know it is not easy and I have been thinking about it. Maybe one thing we could do is to insert a minimum delay between two commands of say 200-500ms so we don’t have to deal with the feedback? Maybe it would only concern devices with a device id>10000 and a parent device number of 1? Maybe the command could be parsed so the delay does not apply to arm/disarm calls?

Yes essentially I am trying to create a command queue which vera isn’t doing. See this post here as well http://forum.micasaverde.com/index.php/topic,118898.0.html which is interesting as well. I am not sure whether it is now truly a zwave problem or whether it is the vera Luup API layer crashing from taking on too many requests at a time.

akbooer · January 16, 2019, 5:22pm

The idea of inserting delays in a ‘real-time’ system in order to get it to work is a bit of an anathema, however, I’d have to say that it wouldn’t be the first time I’d done this (as a last resort.)

Implemented in VeraBridge, it would, by definition, only apply to remote devices, and I do like the idea of doing it only for native Zwave devices. Do you have a reliable test for the problem it’s trying to solve?

rafale77 · January 16, 2019, 5:31pm

Yes I do. I have a particular scene which sends a total of 6 zigbee and 20 zwave commands within the scene (scene obviously is on openLuup). When I run this scene, I have a high probability of causing a Luup reload and I get an error 137 in my vera log. This probability is increased to 100% if while the scene is still running I try to poll the vera using homewave mobile app and send another command (ie, unlock a deadbolt). For now I created a delay of 20s so that the scene sends 10 and 16 commands but the 16 commands still occasionally give me trouble.

akbooer · January 16, 2019, 6:08pm

The easiest way to test this at first would simply be to add a delay to any action request sent by the bridge.

In the following routine, make this adjustment…

local function generic_action (serviceId, name)
    ...

    local url = table.concat (request, '&')
    luup.sleep (200)    -- ADD THIS LINE
    wget (url)
    return 4,0
  end

…can’t believe I’m actually recommending you try this !!

akbooer · January 16, 2019, 9:18pm

It turns out that there’s a much more subtle way to let the scheduler queue the requests (and later, if necessary, delay them.)

You might try, as an alternative, this change:

--
local function generic_action (serviceId, name)
  ...
 
  return {job = job}    -- was {run = job}, now {job = job}, queues for execution rather than runs immediately.
end

I would be very interested in the result from both of these tests. If this latter one is not as successful, then it can be finessed. The huge advantage here is that the processor is free to do other things whilst the request is queued, and there is total control over the minimum delay between requests (or, indeed, whether they are delayed at all.)

rafale77 · January 16, 2019, 10:13pm

Thank you. I will test this after work. I don’t really understand the second alternative though. I honestly never really looked into the job functions.

Edit: I tested the luup sleep with 100ms and it seems to work. It seems like any delay I add to sending actions to the vera helps. I am suspecting it to be less of a zwave problem and more of a CPU speed/thread limitation or vera engine limitation preventing the luup engine from being able to handle and queue so many commands at a time. I tested unlocking my door while the rest of the scene was still running and in spite of a short delay, my lock did unlock which is unusual. The failure rate for it was quite high previously and I would often have to repeat the command several times and eventually would see a luup reload. Let me test a few more days and the second alternative too.
I think a big part of the problem if my theory verifies is that my openLuup runs on a much faster machine than the vera… So my commands come in burst.

reneboer · January 17, 2019, 10:35am

Hi raf,

The Vera Luup engine has two ways of running activities, with run or job. Run is only to be used for (very) short actions that need to return a value, like getting a state. They are executed immediately (sort of). A job is to be used in all other cases and are executed when possible. A properly written plugin does this control in the I…xml file, and used correctly will avoid a lot of reloads. This is an example of the two:

	<action>
		<serviceId>urn:rboer-com:serviceId:Harmony1</serviceId>
		<name>StartActivity</name>
		<job>
			Harmony_StartActivity(lul_settings.newActivityID or "")
			return 4,nil
		</job>
		<jobname>StartActivity</jobname>
    </action>
	<action>
		<serviceId>urn:rboer-com:serviceId:Harmony1</serviceId>
		<name>GetCurrentActivityID</name>
		<run>
			local status, actID = Harmony_GetCurrentActivtyID()
			return actID
		</run>
    </action>

This is what akbooer is referring to. The change to generic_action will run actions as jobs on the Vera, i.e. scheduled, rather than immediately. It could be the correct solution. I’d be interested too in your test result.

Cheers Rene

akbooer · January 17, 2019, 1:25pm

This is true, but jobs offer so much more than this. It’s possible, via cooperative scheduling, to suspend and reenter jobs according to various events such as incoming I/O or specified delay times, or a call which explicitly reschedules the job.

A job can be in, or set itself into, a number of states:


local state =  {
    NoJob=-1,
    WaitingToStart=0,         --  If you return this value, 'job' runs again in 'timeout' seconds 
    InProgress=1,
    Error=2,
    Aborted=3,
    Done=4,
    WaitingForCallback=5,     -- This means the job is running and you're waiting for return data
    Requeue=6,
    InProgressPendingData=7,
 }

The above states are the first return code. The second one is a (safe) delay parameter which can be associated with a number of the states. This is all part of the normal task action mechanism in Vera (and openLuup.) I’ve used technique like this for long-running tasks like transferring files from Vera or downloading updates in AltAppStore.

However, even with a zero delay return, as I suggested you try initially, the openLuup scheduler has an anti-race condition mechanism which means that after a number of immediate reschedules, the job will be suspended for a short while to allow other tasks to run. This may be sufficient for Vera to keep up.

rafale77 · January 17, 2019, 4:33pm

Thank you both akbooer and reneboer. I understand better that piece of code now. I have switched to the job implementation and will test it. It seems much more elegant than the luup.sleep. I will report back later today.

reneboer · January 18, 2019, 9:18am

Hi akbooer,

I know the jobs can do much more, but I guess that is only for the happy few that truly grasp all this like you. I am still in the simple model ;D

Cheers Rene

akbooer · January 18, 2019, 11:34am

Rene, it wasn’t a criticism of your excellent post, just an elaboration.

Simple is good - it has more chance of working first time, every time.

rafale77 · January 18, 2019, 6:57pm

So far the job code has been working ok. I have not been able to crash the luup engine. However the zwave queue remains a problem in the sense that the vera misses sensor status updates while the commands are running. This of course has a large dependance on how quickly the zwave device/network is able to respond to these actions.

akbooer · January 18, 2019, 10:47pm

Glad that works to a point, anyway. Some Vera problems simply can’t be fixed, or at least accommodated, remotely!

rafale77 · January 18, 2019, 11:17pm

Indeed, I am actually working both angles. I have managed to get the Vera to run faster by taking out unused services and upgrading a bunch of packages…

akbooer · January 19, 2019, 11:15am

I may just make that change permanent, then. It’s something I’ve thought about for a long time, witness the comment that’s been in the code from almost day one, and also that the function name is ‘job’ !

  return {run = job}    -- TODO: job or run ?

Is there less perceived delay in operation compared to the [tt]luup.sleep(100)[/tt] experiment?

rafale77 · January 19, 2019, 4:32pm

Yes I was going to comment that you apparently had a hesitation on the question and it was on your to do list. I don’t see any problem going with a job run at this point so you can have one of your TODOs checked out!

rafale77 · January 22, 2019, 6:07pm

After a few days and scenarii tested, I can definitely say the “job” change makes a huge difference. What I initially believed was a problem with zwave command queuing, with zwave being the bottleneck, now I am convinced that it was the luup API on the vera not able to handle the amount of commands I was sending it. I now have memories of provoking Luup reloads when one of these busy scenes was running and trying to get homewave mobile app to update status at the same time, which of course is an http call to provide a full device status update. Knowing that openLuup is also regularly polling the vera, and the various mios server calls (remote access especially), it was probably the cause for a number of luup reloads. This is a significant step forward towards my goal of only one reload a month.

PS: These large scenes now run faster also

akbooer · January 22, 2019, 6:28pm

Well, that’s progress.

I’ve always been able to take down a Vera with too many HTTP requests, but I’ve never had a problem with openLuup crashing any of its bridged Veras, although it’s quite possible that I’ve not tried too hard. Also, I’m running openLuup on a RPi, not anything faster.

The latest development branch already has the change.