IoT2040 running Balena OS 2.12.7+rev1 stop working after 2 days

Hi,

Apologies, I’ve realised this is a slightly older version of the OS than I thought it was. I should be able to carry this out manually on the device, but it will required a reboot after enabling. Would you like me to do this (and is it alright to reboot the device)?

Best regards,

Heds

Hi Heds,

Sure. Please do that.
You can reboot the device.

Best regards,
Burin Sapsiri

Hi Burin,

Thanks very much. I’ll do this shortly and get back to you when it’s complete!

Best regards,

Heds

Hi again Burin,

I’ve enabled persistent logs on the device. These can be found in the /var/log/journal directory in the hostOS of the device. It’d be great to be able to get these should the device go offline again!

Thanks and best regards,

Heds

Hi Heds,

Could you please enable persistent logging on this device as well?
This is the test device that I setup at my office. After anything happen, I can check it immediately.
https://dashboard.balena-cloud.com/devices/d78ea0cc118635286d4e0396847b541b/summary

Best regards,
Burin Sapsiri

Hi Burin,

I just enabled the persistent logging on that device as well.

We will update you the decisions going forward once we hold our internal discussions. Meanwhile, please ping us if this issue happens again so we could take a look at the to-be-persisted logs.

Cheers…

Hi Gelbal,

Thanks very much. Any more information, I will update you.

Best regards,

Hi Heds,

Just for addition information

I just think about and see more on the kernel log that I give you.
The ethernet “enp0s20f6” went down at 162692 seconds (45.19 hours) after startup, but this device went offline around 24 hours after startup. That’s mean it went offline before the ethernet went down.

And for the log “[162692.830999] stmmaceth 0000:00:14.6 enp0s20f6: Link is Down”, I think that one come from the first LAN unplug and re-plug trial that I let my technician do it.

Best regards,

Hi Balena team

The problem just occur on the device on URL below.
https://dashboard.balena-cloud.com/devices/5535031758354b8023db7effb5d4cc1a/summary

The device just reset around half hour ago.

Could you please go to the device and checking?

Best regards

Is this one of the devices that had persistent logging enabled? It also doesn’t have support access enabled so could you please do that for me

Hi Robert,

Sorry. The support access was expired.
I have already re-grant the access.
This device is the one that’s already enabled the persistent logging.

Best regards,

Hi,

Are there anyone remote to the device for checking?
Or anything that I can do for you?

Best regards

Hi @burin.sapsiri,

Unfortunately, it looks like the logs rotated after the reboot, there’s nothing in there I can see that’s helpful.

I also happen to have an IOT2040 on my desk, so I’m going to try a long soak test with a networking application and see if I encounter the same issues as you.

We’re still discussing IOT2040 support internally, but please if you see the issue occur again, if you let us know we’ll try to look at it as soon as it occurs.

Many thanks and best regards,

Heds

Hi,
The problem occur again. And this time I fix it by unplug the LAN cable and plug to another port. Then it come back.
I try to measure the period before it went offline. It’s seem exactly 24 hours after it startup.
I have new log that I grabbed from device 002 and upload here.

The point that device went offline is in the picture below.

journalctl_2019-11-24_1723 0700.log (380.1 KB)

Best regards,

Thank you very much for the journal logs, it’s been extremely useful (I’ve had a look on the device today, and of course all the logs have rotated, so this was vital for us!).

Having looked through, I’ve a couple of comments. The first is that you appear to be running a development device, is this intended? I ask, because dev devices can be SSH accessed without a password. We strongly recommend never running a device in a possibly hostile/non-development environment using a development image, as hostile actors could potentially take control over it. I’ve noticed several accesses, but they appear to be over our VPN which means from the WebTerminal or balena ssh.

When the device disconnects:

Nov 24 07:36:52 5535031 avahi-daemon[721]: Withdrawing address record for 192.168.0.33 on enp0s20f7.
Nov 24 07:36:52 5535031 avahi-daemon[721]: Leaving mDNS multicast group on interface enp0s20f7.IPv4 with address 192.168.0.33.
Nov 24 07:36:52 5535031 avahi-daemon[721]: Interface enp0s20f7.IPv4 no longer relevant for mDNS.
Nov 24 07:36:52 5535031 NetworkManager[651]: <info>  [1574581012.4573] manager: NetworkManager state is now CONNECTED_SITE
Nov 24 07:36:52 5535031 dbus[592]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Nov 24 07:36:52 5535031 systemd[1]: Starting Network Manager Script Dispatcher Service...
Nov 24 07:36:52 5535031 dbus[592]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Nov 24 07:36:52 5535031 systemd[1]: Started Network Manager Script Dispatcher Service.
Nov 24 07:36:52 5535031 nm-dispatcher[9001]: req:1 'connectivity-change': new request (0 scripts)
Nov 24 07:36:52 5535031 nm-dispatcher[9001]: req:1 'connectivity-change': completed: no scripts

The VPN can then no longer connect to the balena backend:

Nov 24 07:37:57 5535031 openvpn[20330]: Attempting to establish TCP connection with [AF_INET]18.232.192.190:443 [nonblock]
Nov 24 07:37:57 5535031 openvpn[20330]: TCP: connect to [AF_INET]18.232.192.190:443 failed: Network is unreachable

After several failures of this type, the VPN falls back to attempting to resolve the VPN hostname, in case the address has changed:

Nov 24 07:38:13 5535031 openvpn[20330]: RESOLVE: Cannot resolve host address: vpn.balena-cloud.com:443 (Name or service not known)
Nov 24 07:38:13 5535031 openvpn[20330]: RESOLVE: Cannot resolve host address: vpn.balena-cloud.com:443 (Name or service not known)

However, what is interesting is that the network interface is still up and active, because when the local engineer disconnects and reconnects the interface we see:

Nov 24 09:43:19 5535031 kernel: stmmaceth 0000:00:14.7 enp0s20f7: Link is Down
Nov 24 09:43:19 5535031 kernel[604]: stmmaceth 0000:00:14.7 enp0s20f7: Link is Down
Nov 24 09:43:21 5535031 balenad[685]: [376B blob data]
Nov 24 09:43:23 5535031 kernel: stmmaceth 0000:00:14.7 enp0s20f7: Link is Up - 100Mbps/Full - flow control rx/tx

It also appears that this didn’t work, and eventually they switched to the secondary port:

Nov 24 09:43:55 5535031 kernel: stmmaceth 0000:00:14.6 enp0s20f6: Link is Up - 100Mbps/Full - flow control rx/tx
Nov 24 09:43:55 5535031 kernel[604]: stmmaceth 0000:00:14.6 enp0s20f6: Link is Up - 100Mbps/Full - flow control rx/tx
Nov 24 09:43:55 5535031 NetworkManager[651]: <info>  [1574588635.4931] device (enp0s20f6): carrier: link connected
Nov 24 09:43:55 5535031 NetworkManager[651]: <info>  [1574588635.5213] device (enp0s20f6): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
...
Nov 24 09:43:56 5535031 avahi-daemon[721]: New relevant interface enp0s20f6.IPv6 for mDNS.
Nov 24 09:43:56 5535031 avahi-daemon[721]: Registering new address record for fe80::2676:4d69:4624:b6b3 on enp0s20f6.*.
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9501] dhcp4 (enp0s20f6):   address 192.168.0.35
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9555] dhcp4 (enp0s20f6):   plen 24 (255.255.255.0)
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9561] dhcp4 (enp0s20f6):   gateway 192.168.0.1
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9570] dhcp4 (enp0s20f6):   lease time 86400
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9633] dhcp4 (enp0s20f6):   nameserver '192.168.0.1'
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9639] dhcp4 (enp0s20f6):   nameserver '8.8.8.8'
Nov 24 09:43:56 5535031 NetworkManager[651]: <info>  [1574588636.9645] dhcp4 (enp0s20f6): state changed unknown -> bound
Nov 24 09:43:57 5535031 dhclient[15507]: bound to 192.168.0.35 -- renewal in 35515 seconds.
...
Nov 24 09:43:58 5535031 NetworkManager[651]: <info>  [1574588638.1190] manager: NetworkManager state is now CONNECTED_GLOBAL

What’s interesting is when the problem initially occurs, it looks like something similar is happening in that it’s attempting to renew a DHCP lease but doesn’t get the new address. It gets into CONNECTED_SITE state, but never CONNECTED_GLOBAL (which is required to ensure it can contact the Internet and hence our VPN backend).

What we think is happening here is that the local DHCP server gives out leases on a 24 or 48 hour basis, and then asks the device to renew the DHCP lease, but it either refuses to renegotiate or for some reason can’t do so with the server.

(Hit Enter too soon).
Because of this, we’d suggest a couple of things:

  1. Ensure that the DHCP server is configured correctly for DHCP leases
  2. In your application, use DBus to restart the NetworkManager the host interface just under ever 24 hours (say 21 hours). This will ensure that the device itself attempts to renegotiate the lease

There’s a good section on how to communicate with NetworkManager from a service via DBus here which should help to get you started.

I’m still soak testing my IOT2040, but as yet have not come across this issue. Please let us know how you get on.

Best regards,

Heds

Hi Hedss,

Your opinion is very very useful. I agree with you. The device that I tested locally, it can run up to 4 days without any problem. But when I setup it on site only 24 hours, it went offline.

The router that IoT2040 connected to is our own. That mean if I static IP address, it should help, right?

I will try both solution, using DBus to reset NM as your suggest. And static IP address as well.

Thanks very very much. I will come back to you again when I get the result.

Best regards,
Burin Sapsiri

Hi Hedss,

How can I increase the ring buffer of the journal logs?
When the problem occur next time, I will let you see on it again with the logs still.

Best regards,

Hi Burin,

A static IP, if possible, should definitely solve the problem as it won’t need to renegotiate the lease. It sounds very likely that this is the cause, if the DHCP lease is shorter than on your local network.

I’m really keen to help you solve this, so please let me know if this works for you! I’ll continue to soak test my device as well.

Best regards,

Heds

Hi Hedss,

After I configured the device to be static IP, now it run for 6 days without any problem.
The problem solved, then I will take a look more on the router problem.

Thanks so much for your support.

Best regards,
Burin Sapsiri