Bricking the supervisor when switching to local mode

Hi,
This is the second time we “bricked” a device while switching it to local mode. I have yet to make a fully replicable process but here are the events that we can recall:

  1. Have an application running on the device, with an active update lock on /tmp/balena/updates.lock
  2. On the balenaCloud dashboard, click “Enable local mode” for the device.
  3. Push to the device, build goes smoothly, but starting the services gets stuck on [Live] Waiting for device state to settle...
  4. Disable local mode, device doesn’t react.
  5. Put it back in local mode, still the same issue.
  6. Get frustrated, investigate :grin:

In order to unbrick our device’s supervisor, we connected through SSH (balena ssh <uuid.local>).
balena ps -a showed the following results:

root@uuid:~# balena ps -a
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                                 PORTS                                            NAMES
d5cf185f8751        fa7ca3bc2ac4                                                            "bash start.sh 'expo…"   2 hours ago         Up 2 minutes                                                                            browser_3508250_1764659
c2af3dc495f8        7fa68b5dac67                                                            "/usr/bin/entry.sh /…"   2 hours ago         Exited (137) 29 minutes ago                                                             frontend_3508249_1764659
58bbf1ba40c2        10cdf088c9b2                                                            "/usr/bin/entry.sh n…"   2 hours ago         Up 2 minutes                           0.0.0.0:8080->8080/tcp, 0.0.0.0:9229->9229/tcp   app_3508248_1764659
1525d2f5e06c        registry2.balena-cloud.com/v2/ee8a630b4962f1e2b4ad682dd9468f7a:latest   "/usr/src/app/entry.…"   12 days ago         Up About a minute (health: starting)                                                    resin_supervisor

Uh oh. Supervisor seems to be restarting. Let’s check the logs (trimmed for readability)

root@79c84ce:~# balena logs resin_supervisor
[info]    Supervisor v12.4.6 starting up...
[info]    Setting host to discoverable
[warn]    Invalid firewall mode: . Reverting to state: off
[info]    🔥 Applying firewall mode: off
[debug]   Starting logging infrastructure
[info]    Starting firewall
[warn]    Ignoring unsupported or unknown compose fields: stdinOpen, envFile
[debug]   Performing database cleanup for container log timestamps
[success] 🔥 Firewall mode applied
[debug]   Starting api binder
(node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
[info]    API Binder bound to: https://api.balena-cloud.com/v6/
[event]   Event: Supervisor start {}
[debug]   Spawning journald with: chroot  /mnt/root journalctl -a -S 2021-04-14 19:07:31 -o json CONTAINER_ID_FULL=d5cf185f87516c052104e587e4da28654b0b9d5912cd9e07416a5cbc11c4e0d1
[debug]   Spawning journald with: chroot  /mnt/root journalctl -a -S 2021-04-14 19:07:37 -o json CONTAINER_ID_FULL=58bbf1ba40c2db1e4528b73b4c46502e98b80b48ba9f86caab359dbdc1379653
[debug]   Connectivity check enabled: true
[debug]   Starting periodic check for IP addresses
[info]    Reporting initial state, supervisor version and API info
[debug]   Skipping preloading
[debug]   VPN status path exists.
[info]    VPN connection is active.
[info]    Waiting for connectivity...
[info]    Starting API server
[info]    Supervisor API successfully started on port 48484
[info]    Applying target state
[debug]   Ensuring device is provisioned
[error]   Scheduling another update attempt in 1000ms due to failure:  Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1_default is ambiguous
[error]         at /usr/src/app/dist/app.js:10:2303379
[error]       at /usr/src/app/dist/app.js:10:2303311
[error]       at Modem.buildPayload (/usr/src/app/dist/app.js:10:2303331)
[error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:10:2302584)
[error]       at IncomingMessage.emit (events.js:322:22)
[error]       at endReadableNT (_stream_readable.js:1187:12)
[error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
[error]   Device state apply error Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1_default is ambiguous
[error]         at /usr/src/app/dist/app.js:10:2303379
[error]       at /usr/src/app/dist/app.js:10:2303311
[error]       at Modem.buildPayload (/usr/src/app/dist/app.js:10:2303331)
[error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:10:2302584)
[error]       at IncomingMessage.emit (events.js:322:22)
[error]       at endReadableNT (_stream_readable.js:1187:12)
[error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
[...]

Looks like there is two networks with an identical name:

root@79c84ce:~# balena network ls
NETWORK ID          NAME                DRIVER              SCOPE
4f4a4c1f23ca        1_default           bridge              local
128b8c05392d        1_default           bridge              local
4c8184636574        1790024_default     bridge              local
5374b2c3295f        bridge              bridge              local
f1991144dff9        host                host                local
e9f4bdaa0b25        none                null                local
2f97014f4a38        supervisor0         bridge              local

In order to restore the device to a usable state, we decided to remove one of the duplicate network:

root@79c84ce:~# balena stop browser_3508250_1764659 frontend_3508249_1764659 app_3508248_1764659
browser_3508250_1764659
frontend_3508249_1764659
app_3508248_1764659
root@79c84ce:~# balena rm browser_3508250_1764659 frontend_3508249_1764659 app_3508248_1764659
browser_3508250_1764659
frontend_3508249_1764659
app_3508248_1764659
root@79c84ce:~# balena network ls
NETWORK ID          NAME                DRIVER              SCOPE
4f4a4c1f23ca        1_default           bridge              local
128b8c05392d        1_default           bridge              local
4c8184636574        1790024_default     bridge              local
5374b2c3295f        bridge              bridge              local
f1991144dff9        host                host                local
e9f4bdaa0b25        none                null                local
2f97014f4a38        supervisor0         bridge              local
root@79c84ce:~# balena network rm 4f4a4c1f23ca
4f4a4c1f23ca
root@79c84ce:~# balena ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                            PORTS               NAMES
1525d2f5e06c        registry2.balena-cloud.com/v2/ee8a630b4962f1e2b4ad682dd9468f7a:latest   "/usr/src/app/entry.…"   12 days ago         Up 3 minutes (health: starting)                       resin_supervisor

After removing the network, waiting a few minutes for the supervisor’s next try, everything went back to normal and we could put the device in local mode and push to it again.

Hi, thanks for the detailed report. This is a known issue, but your report adds extra context to it: Two networks with the same name can make the supervisor unable to apply the target state (network is ambiguous) · Issue #590 · balena-os/balena-supervisor · GitHub
I am going to forward your information to our supervisor maintainers as it will be valuable in solving this.
If you succeed in making this fully replicable that will be even better though and if you can provide even more precise steps it will be super helpful!
Thanks a lot,
Zahari

Thanks @majorz. I will post here and on the issue if I get around to it.