Supervisor/tunneling failure: nothing running on device

adamshapiro0 · August 20, 2020, 10:55pm

Hi,

We reflashed an iMX8MM device this morning with the v2.50.1+rev1 release and it seemed to work ok briefly, but then after a little while it stopped working completely and showed up in the dashboard as status “Online (VPN only)” and none of the containers appear to be running, nor is the console log output updating.

We are able to balena ssh into the host OS (and have physical console access), but trying to SSH into any of the containers outputs:

BalenaRequestError: Request error: tunneling socket could not be established, statusCode=500

Clicking application restart on the device in the dashboard also gives the same error message. We also tried changing the target release of the device, but it isn’t changing from the previous version.

Rebooting the device didn’t change anything. On the serial console, we’re seeing a look like a bunch of device resets coming from the supervisor every few minutes it seems:

br-340bd66480a1: port 4(vethdf2d783) entered disabled state
vethb00b1ad: renamed from eth0
br-340bd66480a1: port 4(vethdf2d783) entered disabled state
device vethdf2d783 left promiscuous mode
br-340bd66480a1: port 4(vethdf2d783) entered disabled state
supervisor0: port 2(veth4a1fd6f) entered disabled state
veth75e526e: renamed from eth1
supervisor0: port 2(veth4a1fd6f) entered disabled state
device veth4a1fd6f left promiscuous mode
supervisor0: port 2(veth4a1fd6f) entered disabled state
br-340bd66480a1: port 4(vethee4a5f1) entered blocking state
br-340bd66480a1: port 4(vethee4a5f1) entered disabled state
device vethee4a5f1 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): vethee4a5f1: link is not ready
br-340bd66480a1: port 4(vethee4a5f1) entered blocking state
br-340bd66480a1: port 4(vethee4a5f1) entered forwarding state
supervisor0: port 2(vetha772b59) entered blocking state
supervisor0: port 2(vetha772b59) entered disabled state
device vetha772b59 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): vetha772b59: link is not ready
supervisor0: port 2(vetha772b59) entered blocking state
supervisor0: port 2(vetha772b59) entered forwarding state
br-340bd66480a1: port 4(vethee4a5f1) entered disabled state
supervisor0: port 2(vetha772b59) entered disabled state
eth0: renamed from veth1213745
IPv6: ADDRCONF(NETDEV_CHANGE): vethee4a5f1: link becomes ready
br-340bd66480a1: port 4(vethee4a5f1) entered blocking state
br-340bd66480a1: port 4(vethee4a5f1) entered forwarding state
eth1: renamed from veth7c6ae65
IPv6: ADDRCONF(NETDEV_CHANGE): vetha772b59: link becomes ready
supervisor0: port 2(vetha772b59) entered blocking state
supervisor0: port 2(vetha772b59) entered forwarding state

We tried restarting the supervisor as described in Services are in a constant restart loop!, and as soon as we did the device showed up as “Online” (i.e., not VPN only) in the dashboard but nothing else changed. We rebooted the device again and now don’t seem to see the supervisor failure messages on the console anymore after the Balena OS banner printed but everything is still broken.

We can reflash the device and see if it happens again, but we’re not sure how it got into this state to begin with and we’re concerned that this could happen to a customer, who would not be able to reflash the device. Is there a way to remotely restore/reflash a device using the infrastructure for host OS updates? Is there any sort of backup partition?

We also don’t want to reflash the device and lose any chance of actually debugging the issue, so we’re going to hold off on doing so until we hear back but would appreciate any help we can get right away.

Worth noting that this is our first use of 2.50.1. We’ve been using a custom build of 2.47.1+rev2, the OS version when we started working with Balena, because at the time that version did not have the Variscite BSP updates necessary for iMX8MM. That version has worked fine for now, but we’d like to get on the mainline releases. The BSP was updated in the 2.50.1 release.

Thanks in advance,

Adam

phil-d-wilson · August 21, 2020, 8:39am

Hi - could you please enable support access on this device, and let us know the device UUID?

Thanks,

Phil

phil-d-wilson · August 21, 2020, 8:45am

Could I also just check what model of Variscite iMX8MM your device is?

adamshapiro0 · August 21, 2020, 1:07pm

Hey @phil-d-wilson,

I enabled support on device e2783a3cc8c134241b686b0a62303c3f. We’ve developed a custom board based on the Variscite DART-MX8M-MINI. All of the peripherals are layed out the same as the reference design, so the official BSP works out of the box without customization.

Thanks,

Adam

richbayliss · August 21, 2020, 2:39pm

Something is wrong with the device here; the issue your seeing came into a version of the supervisor in a later release than the one reported. Have you cleanly flashed the device, or did you then do something to make it come back with the same UUID/key as a previous one? I am not sure how our backend thinks this device is running a version which cannot have this bug…

adamshapiro0 · August 21, 2020, 4:12pm

@richbayliss, yes we cleanly flashed the device with the latest v2.50.1+rev1 image file we downloaded from the dashboard. Unfortunately the config.json file is injected by the dashboard so I can’t really give you a meaningful MD5 hash or anything, but we downloaded it through the “Add New Device” dialog for “Variscite DART-MX8M Mini”.

richbayliss · August 21, 2020, 4:18pm

The device is offline now, but if you can bring it online then I’ll have a look deeper into it. Your setup path looks sound, so I am just wondering what has happened.

adamshapiro0 · August 21, 2020, 4:48pm

Sorry for the delay. A coworker brought it home from the office so he could reflash it or do anything else you might need. He happened to be grabbing it exactly when you were looking. Should be back online now.

nghiant2710 · August 24, 2020, 7:38am

Hi,

Can you please grant support access again for the e2783a3cc8c134241b686b0a62303c3f device as we no longer have access to it?
Please let us know when access granted then we will have a look deeper into it.

adamshapiro0 · August 24, 2020, 1:12pm

@nghiant2710 granted.

shaunmulligan · August 24, 2020, 6:22pm

Just an update here, I have been looking at the device. It looks like its trying to run a different supervisor version (the newer 11.9.3 which is causing it to fail because its missing some prerequisites) We are looking into why it seems to be running this wrong version and will look to revert it to the 11.4.10 version once we get to the bottom of it.

shaunmulligan · August 24, 2020, 6:32pm

One follow up question to help us debug what is going on here. Was the UUID or config.json you used for this device from a previous device?

adamshapiro0 · August 24, 2020, 6:32pm

Thanks for the update @shaunmulligan.

I was discussing with @anathan who originally imaged the device. Before he put 2.50.1+rev1 on it, he tried a few experiments to update to the latest Yocto repo version 2.53.12+rev3 to test some things with the BSP, but ran into some supervisor issues. I believe he tried going backward a few versions with varying results. He then reimaged the device with the official 2.50.1+rev1 release (downloaded directly, not built locally with Yocto).

All of these experiments were done with the same device UUID, since this is the same device.

Is the device actually running a different supervisor version than should be in the 2.50.1+rev1 image, or is it saying it is but the actual binary on the device is correct? Is it possible that the back-end registered the device with a newer supervisor version during these experiments, and doesn’t support supervisor versions being downgraded?

shaunmulligan · August 24, 2020, 6:38pm

Ah okay if this UUID was running a later version that could be the issue. The device is actually running supervisor v11.9.3 but it should be running v11.4.10 . I think what happened is that when it was provisioned with the newer 2.53 it set the “should be running supervisor” pin in our backend to v11.9.3 for this UUID and then somehow when imagining it with an older version it upgraded to that version but its incompatible with the 2.50.1 OS.

shaunmulligan · August 24, 2020, 6:39pm

I see 2.54.2+rev1 is no in staging for a few of the DART-MX8M Mini device types so might be worth moving to that.

adamshapiro0 · August 24, 2020, 6:44pm

Ah ha! That makes sense. Is there not a way to remotely downgrade a device’s host OS if necessary? Might be useful just in case something goes wrong.

We’ll give 2.54.2+rev1 a shot once it’s downloadable. We originally were using a custom build because in the latest release at the time (2.47), the Variscite BSP didn’t include the latest commits that made the iMX8MM work. Those were added in 2.50 though, so now we can use stock releases.

Not sure if we were just configuring something wrong in the build, but we found that devices with our custom Yocto build had the host OS update option disabled in the dashboard, even if we wanted to update them to an official release. Since we don’t really need any host OS customizations anymore anyway, we just switched over to the official releases.

shaunmulligan · August 24, 2020, 6:49pm

unfortunately OS downgrades are not as straight forward because there are on device migrations that are non-reversible but I think for this device we can downgrade the supervisor so that it runs the correct version.

With regards to updating custom OS to mainline that is something that might be possible and we will raise it internally.

adamshapiro0 · August 24, 2020, 6:58pm

Ok that sounds good. We were actually planning to reimage this device anyway, so we could just assign a new UUID and that would resolve this issue. Just wanted to leave it as is so it would be helpful for your debugging in case there was a back end issue of some sort. If it would be helpful though, you’re welcome to try downgrading the supervisor to see if it resolves the problem before we do anything.

shaunmulligan · August 24, 2020, 7:19pm

okay, so I rolled back the supervisor and things look to be running correctly. Definitely recommend using new UUIDs for future devices to avoid these types of issues, but I think now that you are on the mainline device type OS versions you won’t have that problem anymore anyways. Thanks for letting us figure this out.
Cheers

adamshapiro0 · August 24, 2020, 7:21pm

Awesome, thanks for investigating.

Yeah this device was a bit of an experimental case. Now that we’re on the mainline we hopefully won’t have issues going forward, at least not like this one.

Topic		Replies	Views
Stuck devices - "tunneling socket could not be established: cause=socket hang up" Product support	25	6791	November 17, 2020
tunneling socket could not be established: 500 Product support raspberrypi3	16	3797	September 1, 2024
BalenaSound continually booting after updating BalenaOS 2.94.4 Product support	1	286	May 12, 2022
my device randomly stopped working; is currently just showing balenacloud logo; Request error: tunneling socket could not be established, cause=socket hang up Product support	8	391	March 25, 2020
Device showing as VPN online but unable to ssh in Product support raspberrypi3	4	428	May 18, 2021

Supervisor/tunneling failure: nothing running on device

Related topics