Services are in a constant restart loop!

I also have a Raspberry Pi 3 device running this same service that does not seem to show the errors that the Balena Fin v1.1 is showing. Support is enabled for both.

uuid device type OS Supervisor
64b5fccd4dc2a5f42268020c6578c0e4 Belana Fin balenaOS 2.48.0+rev1 10.8.0
16324b595d0261bf90414a6b0830d9fe Raspberry Pi 3 balenaOS 2.47.0+rev1 10.6.27

Hey there! We are still investigating this issue in detail, and I think that there are some further experiments that we can do that would get us closer to the root cause, and are about determining if what’s triggering the problem is base image, or the container settings:

  • Can you try to preserve that exact same Docker Compose definition of the vpn service, but pull another image instead? (i.e. one that just waits on a loop forever) Does the issue happen there as well when you restart?

  • Can you keep the person/openvpn-client image, but get rid of (most of?) the extra options, and see if the container still exists on a restart? I’m guessing the VPN connection might not work without the options you provided, but at least we can see if it gets killed or not

Hi, @jviotti

  1. Preserve docker-compose definition and use a different image:
docker-compose
  vpn:
    image: hypriot/rpi-busybox-httpd
    cap_add:
      - net_admin
    networks:
      - default
    tmpfs:
      - /run
      - /tmp
    security_opt:
      - label=disable
    devices:
      - "/dev/net:/dev/net"
    volumes:
      - ovpn-data:/vpn    
    restart: unless-stopped

Results:

  • On reboot, I can connect to Host terminal session and run balena ps and see the container run for ~25-30 seconds. Then it get’s killed and does not restart. Same behavior we’ve been seeing.
  1. Keep vpn image and remove most of the extra options:
docker-compose
  vpn:
    image: dperson/openvpn-client:armhf
    networks:
      - default
    volumes:
      - ovpn-data:/vpn    
    restart: unless-stopped

Results:

  • Service has a fatal error and goes into a restart unless-stopped loop. Which does actually seem like progress.
  1. Same as 2, adding some options back in to make the container at least stay running:
docker-compose
  vpn:
    image: dperson/openvpn-client:armhf
    cap_add:
      - net_admin
    networks:
      - default
    volumes:
      - ovpn-data:/vpn    
    restart: unless-stopped

Results: Service came up and connected to VPN successfully, but then was killed via SIGTERM from outside the container after a bit. It almost has to be that net_admin capability!?


Something I noticed. The container is up until the Supervisor starts, thats when it get’s killed and it does not restart. See these logs:

logs
21.04.20 22:46:03 (-0500) Rebooting
21.04.20 22:47:10 (-0500) Supervisor starting
21.04.20 22:45:39 (-0500)  vpn  Wed Apr 22 03:45:39 2020 OpenVPN 2.4.8 armv6-alpine-linux-musleabihf [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on Feb  7 2020
21.04.20 22:45:39 (-0500)  vpn  Wed Apr 22 03:45:39 2020 library versions: OpenSSL 1.1.1d  10 Sep 2019, LZO 2.10
...
...
21.04.20 22:47:03 (-0500)  vpn  Wed Apr 22 03:47:03 2020 Initialization Sequence Completed
21.04.20 22:47:11 (-0500) Killing service 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'
21.04.20 22:47:12 (-0500)  vpn  Wed Apr 22 03:47:12 2020 event_wait : Interrupted system call (code=4)
21.04.20 22:47:12 (-0500)  vpn  Wed Apr 22 03:47:12 2020 SIGTERM received, sending exit notification to peer
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 Closing TUN/TAP interface
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 /sbin/ip addr del dev tun0 172.16.16.128/22
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 SIGTERM[soft,exit-with-notification] received, process exiting
21.04.20 22:47:14 (-0500) Service exited 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'
21.04.20 22:47:15 (-0500) Killed service 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'

Timestamps are out of order, but track those timestamps. Supervisor restarting shows up in the log right after the reboot is logged, but the next timestamps are from almost 90 seconds earlier. If we re-order this log snippit by time we get:

21.04.20 22:46:03 (-0500) Rebooting
21.04.20 22:46:57 (-0500)  vpn  Wed Apr 22 03:46:57 2020 OpenVPN 2.4.8 armv6-alpine-linux-musleabihf [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on Feb  7 2020
21.04.20 22:46:57 (-0500)  vpn  Wed Apr 22 03:46:57 2020 library versions: OpenSSL 1.1.1d  10 Sep 2019, LZO 2.10
...
...
21.04.20 22:47:03 (-0500)  vpn  Wed Apr 22 03:47:03 2020 Initialization Sequence Completed
21.04.20 22:47:10 (-0500) Supervisor starting
21.04.20 22:47:11 (-0500) Killing service 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'
21.04.20 22:47:12 (-0500)  vpn  Wed Apr 22 03:47:12 2020 event_wait : Interrupted system call (code=4)
21.04.20 22:47:12 (-0500)  vpn  Wed Apr 22 03:47:12 2020 SIGTERM received, sending exit notification to peer
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 Closing TUN/TAP interface
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 /sbin/ip addr del dev tun0 172.16.16.128/22
21.04.20 22:47:13 (-0500)  vpn  Wed Apr 22 03:47:13 2020 SIGTERM[soft,exit-with-notification] received, process exiting
21.04.20 22:47:14 (-0500) Service exited 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'
21.04.20 22:47:15 (-0500) Killed service 'vpn sha256:c4ff99efa70ec69b5f836ad3fbc65faddf16bb90830762e7a01b4624502cf72a'

Logs stop there as the device does nothing else but sit there with the service in a “Downloaded” state.

Hopefully this is helpful?

One more observation. The Raspberry Pi device I mention above does not exhibit this same behavior on reboot. Everything returns to a stable state as expected on that device. I’ve rebooted that device 3+ times with no issues.

What happens if you remove the restart: unless-stopped line from docker-compose.yml and push?

@dfunckt - removing restart: unless-stopped seems to make the device boot up normally into a stable and running condition. Rebooted 3 times to verify.

I don’t understand this. Should we not be using a restart policy?

Hi @donfmorrison.

The balenaOS supervisor should take care of ensuring that containers are running, so a restart policy is unnecessary.

Cheers,
James.

1 Like

Hi guys,

Did you find a solution about this ? I am experiencing the exact same issue on my Nvidia jetson Xavier running on BalenaOS 2.58.6. I’m successfully running an IPsec VPN Server (https://github.com/hwdsl2/docker-ipsec-vpn-server) (along with 5 others containers) on the board (local push from a docker-compose file) and able to start a connection with the VPN server from my client. However, at some point, the board will reboot without any apparent reason. I’ll be then stuck in the loop described by @donfmorrison where the startup logs prints “the device *** entered promiscuous mode”. After a second manual reboot, those logs are no more printed, but the containers starts and stops indefinitely without making any sense. I’m able to recover the normal state by manually restart the supervisor container multiple times.

The error seems to be related to the VPN (probably something with the network layer).

The thing is, I don’t know which logs to monitor in order to find the cause of those reboots.

I’ll try to make a simple docker-compose file you can run to try to debug that error. Meanwhile, I wanted to know if you had found a reason for those reboots ? Maybe @donfmorrison ?

Thanks,

Quentin

Ok, I managed to record the logs just before the reboot of the board (dmesg -w):

[ 686.884237] BUG: scheduling while atomic: irq/67-host_syn/70/0x00000101
[ 686.884450] Modules linked in: authenc esp4 nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 ip6table_filter xt_state ipt_REJECT nf_reject_ipv4 xt_multiport xt_policy ip6_tables dummy ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter xt_owner bnep joydev mousedev input_leds hid_logitech_hidpp mttcan can_dev hid_logitech_dj rtl8822ce rtk_btusb uvcvideo spidev btusb btrtl btbcm btintel cfg80211 nvgpu
[ 686.885901] CPU: 0 PID: 70 Comm: irq/67-host_syn Not tainted 4.9.140-l4t-r32.4 #1
[ 686.886081] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 686.886237] Call trace:
[ 686.886325] [] dump_backtrace+0x0/0x1b0
[ 686.886458] [] show_stack+0x24/0x30
[ 686.886590] [] dump_stack+0x98/0xc0
[ 686.886721] [] __schedule_bug+0x70/0x80
[ 686.886858] [] __schedule+0x5c8/0x738
[ 686.887181] [] schedule+0x40/0xa8
[ 686.888679] [] schedule_preempt_disabled+0x28/0x40
[ 686.894299] [] __mutex_lock_slowpath+0xf8/0x1a8
[ 686.900063] [] mutex_lock+0x4c/0x50
[ 686.907301] [] tegra_se_aes_queue_req+0x2c/0x98
[ 686.913227] [] tegra_se_aes_cbc_decrypt+0x34/0x40
[ 686.919429] [] skcipher_decrypt_ablkcipher+0x64/0x70
[ 686.925738] [] crypto_authenc_decrypt_tail.isra.1+0xf0/0x118 [authenc]
[ 686.934069] [] crypto_authenc_decrypt+0x94/0xa8 [authenc]
[ 686.940542] [] echainiv_decrypt+0xac/0xc0
[ 686.946482] [] esp_input+0x16c/0x288 [esp4]
[ 686.951983] [] xfrm_input+0x544/0x5b8
[ 686.957408] [] xfrm4_rcv_encap+0xc0/0xf0
[ 686.962466] [] xfrm4_udp_encap_rcv+0x138/0x1d0
[ 686.968572] [] udp_queue_rcv_skb+0x234/0x548
[ 686.974169] [] udp_unicast_rcv_skb.isra.10+0x60/0xb8
[ 686.980546] [] __udp4_lib_rcv+0x6d0/0xa90
[ 686.986579] [] udp_rcv+0x30/0x40
[ 686.991537] [] ip_local_deliver_finish+0x104/0x248
[ 686.997863] [] ip_local_deliver+0x68/0x118
[ 687.003896] [] ip_rcv_finish+0x1f4/0x390
[ 687.009310] [] ip_rcv+0x1f8/0x3e0
[ 687.013852] [] __netif_receive_skb_core+0x500/0xb00
[ 687.020669] [] __netif_receive_skb+0x28/0x80
[ 687.026449] [] process_backlog+0xa4/0x160
[ 687.032229] [] net_rx_action+0x10c/0x388
[ 687.037716] [] __do_softirq+0x148/0x398
[ 687.043309] [] irq_exit+0xd4/0x110
[ 687.048284] [] __handle_domain_irq+0x70/0xc0
[ 687.053890] [] gic_handle_irq+0x54/0xa8
[ 687.059465] [] el1_irq+0xe8/0x194
[ 687.064288] [] host1x_readl+0x58/0xa0
[ 687.069512] [] syncpt_thresh_cascade_isr+0xb4/0x2a8
[ 687.075977] [] irq_thread_fn+0x30/0x80
[ 687.081381] [] irq_thread+0x150/0x1b8
[ 687.086356] [] kthread+0x100/0x108
[ 687.091494] [] ret_from_fork+0x10/0x30
[ 687.097688] softirq: huh, entered softirq 3 NET_RX ffffff8008d75f08 with preempt_count 00000100, exited with 00000000?

Ok, I managed to record the logs just before the reboot of the board (dmesg -w) :

[ 686.884237] BUG: scheduling while atomic: irq/67-host_syn/70/0x00000101
[ 686.884450] Modules linked in: authenc esp4 nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 ip6table_filter xt_state ipt_REJECT nf_reject_ipv4 xt_multiport xt_policy ip6_tables dummy ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter xt_owner bnep joydev mousedev input_leds hid_logitech_hidpp mttcan can_dev hid_logitech_dj rtl8822ce rtk_btusb uvcvideo spidev btusb btrtl btbcm btintel cfg80211 nvgpu
[ 686.885901] CPU: 0 PID: 70 Comm: irq/67-host_syn Not tainted 4.9.140-l4t-r32.4 #1
[ 686.886081] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 686.886237] Call trace:
[ 686.886325] [] dump_backtrace+0x0/0x1b0
[ 686.886458] [] show_stack+0x24/0x30
[ 686.886590] [] dump_stack+0x98/0xc0
[ 686.886721] [] __schedule_bug+0x70/0x80
[ 686.886858] [] __schedule+0x5c8/0x738
[ 686.887181] [] schedule+0x40/0xa8
[ 686.888679] [] schedule_preempt_disabled+0x28/0x40
[ 686.894299] [] __mutex_lock_slowpath+0xf8/0x1a8
[ 686.900063] [] mutex_lock+0x4c/0x50
[ 686.907301] [] tegra_se_aes_queue_req+0x2c/0x98
[ 686.913227] [] tegra_se_aes_cbc_decrypt+0x34/0x40
[ 686.919429] [] skcipher_decrypt_ablkcipher+0x64/0x70
[ 686.925738] [] crypto_authenc_decrypt_tail.isra.1+0xf0/0x118 [authenc]
[ 686.934069] [] crypto_authenc_decrypt+0x94/0xa8 [authenc]
[ 686.940542] [] echainiv_decrypt+0xac/0xc0
[ 686.946482] [] esp_input+0x16c/0x288 [esp4]
[ 686.951983] [] xfrm_input+0x544/0x5b8
[ 686.957408] [] xfrm4_rcv_encap+0xc0/0xf0
[ 686.962466] [] xfrm4_udp_encap_rcv+0x138/0x1d0
[ 686.968572] [] udp_queue_rcv_skb+0x234/0x548
[ 686.974169] [] udp_unicast_rcv_skb.isra.10+0x60/0xb8
[ 686.980546] [] __udp4_lib_rcv+0x6d0/0xa90
[ 686.986579] [] udp_rcv+0x30/0x40
[ 686.991537] [] ip_local_deliver_finish+0x104/0x248
[ 686.997863] [] ip_local_deliver+0x68/0x118
[ 687.003896] [] ip_rcv_finish+0x1f4/0x390
[ 687.009310] [] ip_rcv+0x1f8/0x3e0
[ 687.013852] [] __netif_receive_skb_core+0x500/0xb00
[ 687.020669] [] __netif_receive_skb+0x28/0x80
[ 687.026449] [] process_backlog+0xa4/0x160
[ 687.032229] [] net_rx_action+0x10c/0x388
[ 687.037716] [] __do_softirq+0x148/0x398
[ 687.043309] [] irq_exit+0xd4/0x110
[ 687.048284] [] __handle_domain_irq+0x70/0xc0
[ 687.053890] [] gic_handle_irq+0x54/0xa8
[ 687.059465] [] el1_irq+0xe8/0x194
[ 687.064288] [] host1x_readl+0x58/0xa0
[ 687.069512] [] syncpt_thresh_cascade_isr+0xb4/0x2a8
[ 687.075977] [] irq_thread_fn+0x30/0x80
[ 687.081381] [] irq_thread+0x150/0x1b8
[ 687.086356] [] kthread+0x100/0x108
[ 687.091494] [] ret_from_fork+0x10/0x30
[ 687.097688] softirq: huh, entered softirq 3 NET_RX ffffff8008d75f08 with preempt_count 00000100, exited with 00000000?

How to reproduce the issue

Run this docker-compose on the balena device (with local push) :

docker-compose

version: '2'

services:
  ipsec-vpn:
    image: hwdsl2/ipsec-vpn-server:latest
    hostname: ipsec-vpn-server
    privileged: true
    restart: always
    environment:
      VPN_IPSEC_PSK: '111111111111111111111111'
      VPN_USER: user
      VPN_PASSWORD: 123456789
    ports:
      - "500:500/udp"
      - "4500:4500/udp"
    networks:
      - mynetwork

networks:
  mynetwork:
    driver: bridge
    ipam:
      config:
        - subnet: 10.5.0.0/16

Connect an Ubuntu client on the server (you can follow the guide to do so : https://github.com/hwdsl2/setup-ipsec-vpn/blob/master/docs/clients.md#linux )

Then, on your connected client host (ubuntu), run that very simple command : docker pull ubuntu.

When doing so, my balena device reboots…

Hi, @quentingllmt

I kind of gave up on resolving this 100%. I have a manual work around that we can perform remotely on the device that recovers from this “all services rebooting in a loop” state that it gets in sometimes after a device reboot. The same thing you are describing.

The following actions taken from the host OS seem to resolve this state:

  • systemctl stop resin-supervisor to stop the supervisor froim trying to restart the containers
  • balena kill $(balena ps -a | grep -v "resin_supervisor" | awk 'NR>1 {print $1}') to kill all containers that are not the supervisor
  • balena rm $(balena ps -a | grep -v "resin_supervisor" | awk 'NR>1 {print $1}') remove those containers you just killed
  • systemctl start resin-supervisor to start the supervisor back up which will create and start all your other service containers

I planned on circling back around to trying to fix this eventually – but, I do agree with your assessment that there is something the open-vpn client service is doing to the network layer that BalenaOS or the Supervisor service does not like.

You seem to be running a VPN server, whereas I am running a VPN client but we do seem to be stumbling over the same issue. I’ll be sure and post here and tag you if I do discover a real solution.

Cheers,
Don