DNS failure not caught by supervisor

Odd, I wonder why those two in particular.

Just balena push <app-name>, nothing fancy.

Just mac? Should we go the whole hog and statically assign all relevant network settings for all containers?

Going one step at a time is best. The containers sharing a MAC address is definitely an issue, so let’s start from there.

Ok, that’s fair enough.

I took a look through a number of devices yesterday and checked for signs of the issue over the last 3 days of logs.

2.51: 4/4
2.46: 4/12
2.44: 0/10
[2.38 we know we had 2 devices running for nearly a year with no issues]

The devices with 2.51 was so bad yesterday that for a while even reboots etc. weren’t fixing the issues. One of them (our in-the-field test) is now actually down and never came back up after issuing the reboot.
The other 3 are dev devices, and two have reflashed onto 2.46 so that they can continue their work.
My 2.51 I left on 2.51, and at some point in the afternoon it began working properly again after reboot #x.

Our current plan until we get to the root of this is to flash anything we plan on putting out there to 2.38 (2.44 isn’t available, it jumps from 2.38 to 2.46). Given the Fin recall has pushed our manufacturing and installation schedule back, this still gives a fair window to find a resolution first.

That all said, this morning I am setting up at least one extra device (two if this other one I have works, I don’t recall if it does) with a prod image (mine is dev) to increase our testing base and more accurately reflect what would be in the field.
[Edit: just the one extra. Turns out I don’t have any other power supplies]

To these devices I will be pushing a version with MACs set for the mqtt and networking containers as requested.

Then it’s sit back and wait time.

I shall yell when things have happened.

Sounds like a great plan. Please let us know how the experiments go and we’ll be here to help further!

Device hit the same issue with mqtt and networking mac’s set - e.g. at 2020-09-02T20:25:55.5555Z.

Particular device this is running on: cec711c00f6f9f2c866efe9c0dd088f1.
I’ve granted support access.

Hi,
we have had another look and we can see balena engine crashes on the device. We suspect this could be the cause and could be related to a known bug that we are already working on - https://github.com/balena-os/balena-engine/issues/225
Would it be possible for you to

  • Reduce the number of containers by 1 or 2 just for the sake of testing? This would help us to determine whether this is realy related to the bug mentioned above.
  • If you have devices that do not experience this issue, look for balena engine crashes on those? I would expect this to correlate - the devices with no issues would not experience the engine crashes while the devices where the engine crashes would experience the DNS issue.

Hmm, ok, sure. I think of all the containers data-send and external-api can be turned off with little to no effect on the general running.

What do you want network settings side? Should i leave networking/mqtt with an assigned mac or take that out and push it all out sans data-send/external-api?

Hello Aidan, try taking them out completely. If it works without any problems, we come closer to the fact that it is related to the issue mentioned by Michal. If it still doesn’t work, we need to investigate further

Hola!

Still dying with those containers removed.

What’s the next thing on the list to try?

Hello, the device checks still show the container being very low on memory (> 90% in use), I wonder if this might be related to the engine crashes. From what I understand you have some other devices experiencing the same issue, and others that are working fine. We think this issue is caused by the engine crashing, which in turn might be caused by the device simply being out of memory. Would you be able to check if any of the other failing devices have a similar low amount of available memory? (You can check this from the device diagnostic page by running checks). Could you also check a pair of working and non-working device to confirm that the DNS issues are correlated to the engine crashing?

Our prod device that highlighted all this in the first place by upgrading to 2.51 (online for 2 days and currently suffering with the DNS issue):

My 2.51 test device:

The other test device I have that’s running 2.51 with untouched container config as a control:

One of our prod devices running 2.46 and online for a day (30 hours uptime currently)

Another prod 2.46, but with only ~2h uptime:

Our sole prod device left still running 2.38:

Thoughts? Feelings? Direction to go from here?

Hi there – thanks for the additional information, and apologies for missing your earlier response. We’ll check with our engineers about possible next steps, and get back to you shortly.

All the best,
Hugh

One other thing I meant to add: support access for your device has expired. Can we please ask you to extend it again? Because there have been a couple of devices shared before, can you let us know the current state (test device, test device with untouched config, etc).

Thanks again,
Hugh

Good morning,

Expired - it’s been a week already? Time flies.

Access granted for another week.

Current state is I’ve just restarted them; they weren’t showing as online. Fairly certain that this is due to my ethernet occasionally being disconnected/reconnected (I need to replace that cable…). The devices don’t seem to like switching between ethernet and GSM connections that much and eventually will stay locked as being offline.

Before that, they were both in the locked up DNS state that this thread exists to address. I turned off our watchdog so that the issue didn’t get hidden behind auto-restarts.

I suspect we won’t have to wait long for them to reach that state again.

Thanks.

Hi there – can you confirm which device you’ve granted support to? If it is cec711c00f6f9f2c866efe9c0dd088f1, it looks like that’s offline currently. If it’s another device, can you let us know the UUID?

Thanks,
Hugh

a70bc934ac992a3cc877fc5b4c26d376 (6E8B5B57)
and
cec711c00f6f9f2c866efe9c0dd088f1 (579DFBBE)

You’re right, cec was offline for some reason (lights on, but apparently no one home). Didn’t notice. Will need to see why that was, but I’ve restarted it.

Hi

  • Thanks for granting access. I see a70bc934ac992a3cc877fc5b4c26d376 is online, while cec711c00f6f9f2c866efe9c0dd088f1.

We’ve been discussing this issue internally, and there re some things we want to try on this with your permission -

  • First is increasing the log levels of dnsmasq. I’ll be happy to do this on your device if we have your permission. Or I can share the details on how to do it.
  • Next I’d like to install tshark. This will give us an idea of what network traffic is going where. I am happy to share the instructions for this as well, so that you can run it yourself and share the details. I understand that network traffic is not something you’d want folks to look at. But even the redacted logs would help us figure what is happening with the requests.

While interesting for the future to have those instructions for ourselves, it’ll be much faster and more effective for you to do what you gots to do.

There’s no sensitive data on the devices, and we’re happy enough for you to poke these dev devices to your hearts content to get to the bottom of it all.

Go nuts.