[1.57][SMP] Call to native.call() in peripheral.call() does not return

ncmaothvez #1

ncmaothvez's profile picture

4 posts

Location Sweden

Posted 04 September 2014 - 01:49 PM

Hi all!

New member here with a rather odd ComputerCraft problem. First my question and then I'll go a bit more into detail about the problem.

Q: Is it at all possible for a call to native.call() to NOT return? (Looks like that's the case)

Backgroud
So, using CC 1.57 I'm coding a packet routing wireless network ontop of rednet that uses a set of fixed "routers" (CC computers with wireless modems) and mobile nodes (CC turtles with wireless modems). The goal is to setup a wireless network that can route data packets over distances much longer than the modem range.

All routers and turtles are located in chunks permanently loaded with ChikenChunks chunkloaders. I've changed the CC configuration so that the wireless modem range is the same regardless of altitude and weather. When a turtle is getting close to the max range for the currently used router, the turtle switches to another closer router.

The router code uses two main functions, both started with parallel.waitForAll(f1, f2). Both functions use os.pullEventRaw() with no filters to pull events.

The problem
The problem I'm having is that the router function, let's call it f2(), mainly responsible for acting on "rednet_message" events, and sending rednet messages as a result of those "rednet_message" events, sometimes stops responding to all events, not just "rednet_message" events. The other main fuction f1() is still processing its events so it looks like f2() has yielded. After much frustration and bug hunting with strategically placed trace messages, I finally tracked down the source of the problem: A call to native.call() inside the peripheral API's call() function does not return. Since the call doesn't return, the upstream call to rednet.broadcast() doesn't return and the event handling loop is stalled.

Some observations:

It doesn't matter if the the turtle's chunk is loaded by me moving along with the turtle or if the turtle places chunkloaders to keep the chunks loaded.
The problem shows up only when the turle is moving. This could suggest chunk loading issues but keep in mind that the routers' chunks overlap neither with each other nor the turtle's chunks.
The computers and turtle never resets or terminates unexpectedly.


-- call() in peripheral API:

function call( _sSide, _sMethod, ... )
if native.isPresent( _sSide ) then
  return native.call( _sSide, _sMethod, ... )  -- <========= Never returns!
end
for n,sSide in ipairs( rs.getSides() ) do
  if native.getType( sSide ) == "modem" and not native.call( sSide, "isWireless" ) then
   if native.call( sSide, "isPresentRemote", _sSide )  then
	return native.call( sSide, "callRemote", _sSide, _sMethod, ... )
   end
  end
end
return nil
end

The call stack is as follows:


rednet.broadcast(msg)
	rednet.send(CHANNEL_BROADCAST, msg)
		peripheral.call(sModem, "transmit", nRecipient, os.getComputerID(), sMessage)
			return native.call( _sSide, _sMethod, ... )  -- <========= Never returns!

The parameters to native.call() when the call doesn't return are "back", "transmit", 65535, 18 and the string to be transmitted.

Has anyone seen any similiar issues with calls to native functions not returning? Were you able to work around the problem and if so how?

Thanks for reading! :)/>

Bomb Bloke #2

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 08 September 2014 - 01:51 AM

I think you might be misinterpreting the symptoms you're seeing.

I say this because ComputerCraft will crash any co-routine that runs for longer than ten seconds without yielding - neither rednet.broadcast() nor the functions it calls do so; if there was a stall while calling it, your entire script should error out.

So what I'm suspecting is that the f2 function itself is returning, or parallel.waitForAll() is finding some other reason not to revive it when f1 yields.

Perhaps try switching to parallel.waitForAny(). This'd allow you to determine whether f2 is returning or not.

How often, on average, are your turtles sending messages? How long do they tend to last before bombing out?

ncmaothvez #3

ncmaothvez's profile picture

4 posts

Location Sweden

Posted 08 September 2014 - 04:52 PM

Failed post of unfinished message removed.

Edited on 08 September 2014 - 04:01 PM

ncmaothvez #4

ncmaothvez's profile picture

4 posts

Location Sweden

Posted 08 September 2014 - 06:00 PM

Sorry 'bout the double-posting! Somehow hitting spacebar at the wrong moment submitted a half finished post rather than inserting a space. :blink:/>

Thanks for helping me figure out what the problem is.

Bomb Bloke, on 08 September 2014 - 01:51 AM said:
if there was a stall while calling it, your entire script should error out.

True, and I couldn't figure out why it didn't error out.

Tried waitForAny() as you sugested and indeed, the f2 co-routine does terminate unexpectedly inside the call to native.call(). Why am i so certain that the problem is in the call to native.call() from peripheral.call()? Here's why:

At the top of peripheral.call(), there's this code, to which I've added some trace messages:


if trace then print("--opcA") end
if native.isPresent( _sSide ) then
  if trace then print("--opcB "..os.time()*1000) end
  --return native.call( _sSide, _sMethod, ... )
  local ret = native.call( _sSide, _sMethod, ... )
  if trace then print("--opcB2 "..os.time()*1000) end
  return ret
end
if trace then print("--opcC") end

And this is the co-routine selection loop inside the parallel API's runUntilLimit() function:


for n=1,count do
local r = _routines[n]
if r then
  if tFilters[r] == nil or tFilters[r] == eventData[1] or eventData[1] == "terminate" then
   local t1 = os.time()*1000
   print("--rulA "..t1)
   local ok, param = coroutine.resume( r, unpack(eventData) )
   local t2 = os.time()*1000
   print("--rulB "..t2.."  (coroutine.resume() took "..t2-t1.." ticks.)")
   if not ok then
	error( param )
   else
	tFilters[r] = param
   end
   if coroutine.status( r ) == "dead" then
	_routines[n] = nil
	living = living - 1
	if living <= _limit then
	 print("--rulC "..os.time()*1000)
	 return n
	end
   end
  end
end
end

The "–opcA" and "–opcB" messages get printed but not "–opcB2". If the call to native.call() in parallel.call() did return, then "–opcB2" would also be printed but it's not. Instead the execution continues at "–rulB" in the parallel API's runUntilLimit() function which exits right after printing "–rulC", thus causing waitForAny() to return and the program to terminate.

So, the trace messages sequence is:
"–rulA"
…
"–opcA"
"–opcB"
"–rulB"
"–rulC"

That's makes me suspect that something is happening inside the call to native.call(), which changes the co-routine's state to "dead", which in turn terminates the co-routine causing waitForAny() to return and finaly ending the script.

The execution time from "–rulA" to "–rulC" is less than 3 ticks.

Bomb Bloke, on 08 September 2014 - 01:51 AM said:
How often, on average, are your turtles sending messages? How long do they tend to last before bombing out?

So far I've tested with only one turle and seven "routers". The turtle never bombs out, it's only the stationary routers that are having issues. The turtle and routers use the same "network" code.

The message rate is about two message per second and a router usually bombs out after several hunderd messages, sometimes after more than a thousand messages. What's really weird is that the routers bomb out only if the turtle is moving. If the turtle is stationary, then I can push several messages per second through the network for hours without any issues at all.

Bomb Bloke #5

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 09 September 2014 - 03:34 AM

Truth be told, every attempt I've made at networking turtles has also ended in tears. I haven't run into this specific behaviour (no, for me it's always an out-right MineCraft server stall), but I've never sent that amount of messages to a moving turtle, so I probably didn't have time to encounter it before one of the other bugs triggered…

It's possible it's fixed CC 1.63, but I've yet to build a script to stress-test it and I'm not sure I will. It's also possible a side-mod (eg OpenPeripheral) is interfering in some way. In any case, you won't be able to outright "fix" this while working within your current modpack, so the best you can hope for is a work-around.

Maybe something like this?:

parallel.waitForAny(f1, function() while true do f2() end end)

Or if that doesn't work, this?:

parallel.waitForAny(f1, function() while true do parallel.waitForAny(f2) end end)

Of course, this presents a problem in that f2 is gonna suddenly lose track of its execution randomly, which might be difficult to account for.

In which case, you could build a dedicated wrapper function for your broadcasts to isolate the fault. Something like this:

local function f3()
  while true do
    local myEvent = {os.pullEvent("broadcast_message")}
    rednet.broadcast(myEvent[2])
  end
end

You'd start all functions:

parallel.waitForAny(f1, f2, function() while true do f3() end end)

Then whenever you want to send a broadcast, you do os.queueEvent("broadcast_message","myMessage"). You'd need to yield once for every message to be sent, so there'd be a bit of timing logic to consider, but I'm sure you get the idea.

ncmaothvez #6

ncmaothvez's profile picture

4 posts

Location Sweden

Posted 09 September 2014 - 06:41 PM

Bomb Bloke, on 09 September 2014 - 03:34 AM said:
Truth be told, every attempt I've made at networking turtles has also ended in tears. I haven't run into this specific behaviour (no, for me it's always an out-right MineCraft server stall), but I've never sent that amount of messages to a moving turtle, so I probably didn't have time to encounter it before one of the other bugs triggered…

Yep, I'm surprised I've gotten it to run as well as it does. The fact that I'm the only one playing on the server and that the server is runing in a VirtualBox on the same PC as the client probably helps keeping the issues to a minimum.

Bomb Bloke, on 09 September 2014 - 03:34 AM said:
In which case, you could build a dedicated wrapper function for your broadcasts to isolate the fault. Something like this:

Oh! That's neat. Me likes :)/> Will try that.

Thanks for your help with this. Much appreciated! *bows*