Odd, I wonder why those two in particular.
Just balena push <app-name>
, nothing fancy.
Just mac? Should we go the whole hog and statically assign all relevant network settings for all containers?
Odd, I wonder why those two in particular.
Just balena push <app-name>
, nothing fancy.
Just mac? Should we go the whole hog and statically assign all relevant network settings for all containers?
Going one step at a time is best. The containers sharing a MAC address is definitely an issue, so letâs start from there.
Ok, thatâs fair enough.
I took a look through a number of devices yesterday and checked for signs of the issue over the last 3 days of logs.
2.51: 4/4
2.46: 4/12
2.44: 0/10
[2.38 we know we had 2 devices running for nearly a year with no issues]
The devices with 2.51 was so bad yesterday that for a while even reboots etc. werenât fixing the issues. One of them (our in-the-field test) is now actually down and never came back up after issuing the reboot.
The other 3 are dev devices, and two have reflashed onto 2.46 so that they can continue their work.
My 2.51 I left on 2.51, and at some point in the afternoon it began working properly again after reboot #x.
Our current plan until we get to the root of this is to flash anything we plan on putting out there to 2.38 (2.44 isnât available, it jumps from 2.38 to 2.46). Given the Fin recall has pushed our manufacturing and installation schedule back, this still gives a fair window to find a resolution first.
That all said, this morning I am setting up at least one extra device (two if this other one I have works, I donât recall if it does) with a prod image (mine is dev) to increase our testing base and more accurately reflect what would be in the field.
[Edit: just the one extra. Turns out I donât have any other power supplies]
To these devices I will be pushing a version with MACs set for the mqtt
and networking
containers as requested.
Then itâs sit back and wait time.
I shall yell when things have happened.
Sounds like a great plan. Please let us know how the experiments go and weâll be here to help further!
Device hit the same issue with mqtt and networking macâs set - e.g. at 2020-09-02T20:25:55.5555Z.
Particular device this is running on: cec711c00f6f9f2c866efe9c0dd088f1.
Iâve granted support access.
Hi,
we have had another look and we can see balena engine crashes on the device. We suspect this could be the cause and could be related to a known bug that we are already working on - https://github.com/balena-os/balena-engine/issues/225
Would it be possible for you to
Hmm, ok, sure. I think of all the containers data-send and external-api can be turned off with little to no effect on the general running.
What do you want network settings side? Should i leave networking/mqtt with an assigned mac or take that out and push it all out sans data-send/external-api?
Hello Aidan, try taking them out completely. If it works without any problems, we come closer to the fact that it is related to the issue mentioned by Michal. If it still doesnât work, we need to investigate further
Hola!
Still dying with those containers removed.
Whatâs the next thing on the list to try?
Hello, the device checks still show the container being very low on memory (> 90% in use), I wonder if this might be related to the engine crashes. From what I understand you have some other devices experiencing the same issue, and others that are working fine. We think this issue is caused by the engine crashing, which in turn might be caused by the device simply being out of memory. Would you be able to check if any of the other failing devices have a similar low amount of available memory? (You can check this from the device diagnostic page by running checks). Could you also check a pair of working and non-working device to confirm that the DNS issues are correlated to the engine crashing?
Our prod device that highlighted all this in the first place by upgrading to 2.51 (online for 2 days and currently suffering with the DNS issue):
My 2.51 test device:
The other test device I have thatâs running 2.51 with untouched container config as a control:
One of our prod devices running 2.46 and online for a day (30 hours uptime currently)
Another prod 2.46, but with only ~2h uptime:
Our sole prod device left still running 2.38:
Thoughts? Feelings? Direction to go from here?
Hi there â thanks for the additional information, and apologies for missing your earlier response. Weâll check with our engineers about possible next steps, and get back to you shortly.
All the best,
Hugh
One other thing I meant to add: support access for your device has expired. Can we please ask you to extend it again? Because there have been a couple of devices shared before, can you let us know the current state (test device, test device with untouched config, etc).
Thanks again,
Hugh
Good morning,
Expired - itâs been a week already? Time flies.
Access granted for another week.
Current state is Iâve just restarted them; they werenât showing as online. Fairly certain that this is due to my ethernet occasionally being disconnected/reconnected (I need to replace that cableâŚ). The devices donât seem to like switching between ethernet and GSM connections that much and eventually will stay locked as being offline.
Before that, they were both in the locked up DNS state that this thread exists to address. I turned off our watchdog so that the issue didnât get hidden behind auto-restarts.
I suspect we wonât have to wait long for them to reach that state again.
Thanks.
Hi there â can you confirm which device youâve granted support to? If it is cec711c00f6f9f2c866efe9c0dd088f1, it looks like thatâs offline currently. If itâs another device, can you let us know the UUID?
Thanks,
Hugh
a70bc934ac992a3cc877fc5b4c26d376 (6E8B5B57)
and
cec711c00f6f9f2c866efe9c0dd088f1 (579DFBBE)
Youâre right, cec was offline for some reason (lights on, but apparently no one home). Didnât notice. Will need to see why that was, but Iâve restarted it.
Hi
a70bc934ac992a3cc877fc5b4c26d376
is online, while cec711c00f6f9f2c866efe9c0dd088f1
.Weâve been discussing this issue internally, and there re some things we want to try on this with your permission -
dnsmasq
. Iâll be happy to do this on your device if we have your permission. Or I can share the details on how to do it.While interesting for the future to have those instructions for ourselves, itâll be much faster and more effective for you to do what you gots to do.
Thereâs no sensitive data on the devices, and weâre happy enough for you to poke these dev devices to your hearts content to get to the bottom of it all.
Go nuts.