I’m getting a ‘no logs found’ error - we have 6 devices running; 5 of which are in physically inaccessible locations + 1 spare that we have with us.
Of these 6 devices, 3 devices are unable to download/sync with our latest image - the common symptom across these 3 devices is that we see a ‘no logs found’ error in both the CLI and the main UI.
After we power cycled the spare device that was with us, it was able to sync and download the most recent image. But since the 5 devices are not with us, we’re unable to power cycle through them. Is there anything else we can do?
We’ve given you access to one of our devices (e709026bc78974b75fe136241e9692f8).
@prestonlim is my colleague and rectifying this issue is important to us. We are also in the midst of trying out resin.io (Starter) and so far it has been impressive. However, this issue is the most critical one we have encountered so far.
Fortunately, I have found a way to mitigate this. Below is a post mortem:
Ok so resinOS runs a special container engine called balena. I could use the balena command on the host os (sshed in) to see running containers and i found this supervisor container called resin_supervisor. I believe this supervisor container is a process that communicate with resin.io. Out of curiosity i went to check its logs:
root@e2779d9:~#
root@e2779d9:~# balena logs --tail=20 resin_supervisor
process '/usr/src/app/run.sh node /usr/src/app/dist/app.js' (pid 255) exited. Scheduling for restart.
starting pid 265, tty '/dev/stdout': '/usr/src/app/run.sh node /usr/src/app/dist/app.js'
[2018-04-13T09:24:27.307Z] Knex:warning - Can't take lock to run migrations: Migration table is already locked
[2018-04-13T09:24:27.314Z] Knex:warning - If you are sure migrations are not running you can release the lock manually by deleting all the rows from migrations lock table: knex_migrations_lock
[2018-04-13T09:24:27.322Z] Unhandled rejection MigrationLocked: Migration table is already locked
process '/usr/src/app/run.sh node /usr/src/app/dist/app.js' (pid 265) exited. Scheduling for restart.
starting pid 275, tty '/dev/stdout': '/usr/src/app/run.sh node /usr/src/app/dist/app.js'
[2018-04-13T09:25:02.556Z] Knex:warning - Can't take lock to run migrations: Migration table is already locked
[2018-04-13T09:25:02.564Z] Knex:warning - If you are sure migrations are not running you can release the lock manually by deleting all the rows from migrations lock table: knex_migrations_lock
[2018-04-13T09:25:02.571Z] Unhandled rejection MigrationLocked: Migration table is already locked
process '/usr/src/app/run.sh node /usr/src/app/dist/app.js' (pid 275) exited. Scheduling for restart.
starting pid 285, tty '/dev/stdout': '/usr/src/app/run.sh node /usr/src/app/dist/app.js'
[2018-04-13T09:25:37.772Z] Knex:warning - Can't take lock to run migrations: Migration table is already locked
[2018-04-13T09:25:37.779Z] Knex:warning - If you are sure migrations are not running you can release the lock manually by deleting all the rows from migrations lock table: knex_migrations_lock
[2018-04-13T09:25:37.786Z] Unhandled rejection MigrationLocked: Migration table is already locked
process '/usr/src/app/run.sh node /usr/src/app/dist/app.js' (pid 285) exited. Scheduling for restart.
starting pid 303, tty '/dev/stdout': '/usr/src/app/run.sh node /usr/src/app/dist/app.js'
[2018-04-13T09:26:13.091Z] Knex:warning - Can't take lock to run migrations: Migration table is already locked
[2018-04-13T09:26:13.099Z] Knex:warning - If you are sure migrations are not running you can release the lock manually by deleting all the rows from migrations lock table: knex_migrations_lock
[2018-04-13T09:26:13.107Z] Unhandled rejection MigrationLocked: Migration table is already locked
root@e2779d9:~#
It turns out that the supervisor container is unhealthy and not able to start due to a sqlite table locked.
Inspecting the container tells me that the database file is located in /resin-data/resin-supervisor/database.sqlite in the host os. So i ran these series of commands:
cd /resin-data/resin-supervisor
mv database.sqlite database.sqlite.bak
balena restart resin_supervisor
Basically, I took a leap of faith hoping that the supervisor is smart enough to initialise a new database if it can’t find it. It did inititalise a new one and the device is now back to business.
This is definitely not a long-term solution. I hope the resin.io team would find these helpful and provide a fix which would eliminate such an issue from occuring in future
We’re seeing something similar, where devices are occasionally getting into a state where they don’t respond to most actions on the web management interface.
When the Logs panel says “No logs yet” we’ve found that it’s impossible to ssh into the container image (we just see an “SSH session disconnected” message followed by multiple failed retries), the restart button doesn’t do anything and the reboot button causes the following error: “Device not found: 893482”.
We’ve seen these symptoms five or six times in the last three months (ish), and previously we’ve had the on-site staff power-off the Pi and power it back on again. In today’s case we’ve found that even that hasn’t worked, and we’re still unable to manage it remotely, and we’re having to image a replacement.
It doesn’t seem like a hardware failure; for example today’s failing device is still talking to the database app it connects to, and is still sending data to the network printer it prints to. And it’s registering with the Resin VPN at least well enough to show up on the device list with an accurate on-line timer.
This is running supervisor version: 6.3.6 and Host OS: 2.7.5+rev1
Same problem… Since today in the morning I have a node that lose the control from Web Application and any action from Web Page doesn’t work. Apparently my application work… but the node don’t receive the updates!!! Help!!! Help!!!
@chrissng good job there finding the workaround! You’re right that the supervisor will recreate the database and that deleting it was the correct fix in this case (though don’t rely on this behavior being supported or staying like this in the future).
Since the bug happens with database migrations, it should not reoccur once you’ve applied the workaround.
@leandro I believe you’re hitting the same issue, as noted on the other thread.
@markweston being a different OS version, the issue you’re hitting must be a different one. Could you point us to the uuid of an affected device, and grant us support access? I believe an OS update is also likely to fix your issue.
The trouble is, we’ve only seen this problem at times when we need the device back up and running right now. I ran into the issue just today, but there wasn’t time to provide support access and wait until you were available to take a look.
What I will do is update all our devices to the latest version or ResinOS and see if the bug recurs after that.