I experienced the strangest thing today. A day or so ago - all of my home automation went offline. At the core is a HomeAssistant, running on a Balena Fin.
It took me awhile to figure it out (there are a LOT of moving parts and integrations) - but eventually I noticed the Mobile App was unable to connect to HomeAssistant.
So - off to the Balena Dashboard. The device showed as ‘online’ – but I couldn’t connect to it’s (enabled) public URL. So…
I decided to reboot it.
It went offline - and never came back. I’ve checked its internal IP - it’s not even pingable on the local network. Just for kicks, I power-cycled the device, to see if that would help: It didn’t.
So – the device is completely offline, and the “Wife Acceptance Factor” on the home automation is taking a serious hit. Thank goodness we still have a few manual switches that can be used.
Short of hooking up monitor and keyboard – any suggestions?
could you please share any of the logs you could rescue from your device? What fin, OS and supervisor versions were you running? i understand that you can’t see the diagnostics (temperature, under voltage, etc?)
Hi @mpous – thanks for following up! This is quite interesting…
I tried several attempts, as noted above, to reboot the device - with the situation just going from bad to worse - from online in the Balena dashboard, while the HomeAssistant app wasn’t functional) to completely offline.
Ironically - a week or so later - it seemed the device somehow came back online, and HomeAssistant started working (I noticed this because some of my scheduled automations were firing). I scratched my head in confusion - but was grateful that things seemed to be working, and with everything going on right now - didn’t dig any deeper.
Then - two days ago – it died again. HomeAssistant unresponsive - but, as of right now, the device is still “online” and accessible in the Balena dashboard. I’m a bit apprehensive to reboot it - as this is the action I took last time which triggered it to go offline, all together.
Guess I’ll go dig into the logs while the device is accessible, and see what I can see.
I’ll report back what I find! Maybe I need a new build with a HomeAssistant update…
Without having any insight in your project @SnoWake - mysteriously dying and re-living in IT seems like some rooted dependency problem. I saw e.g. projects which included Javascript libraries from the net which went down - killing the project, nodeJS applications dying because some library had been offed from PyPi and so on and on. Maybe not the same, but some issue there? Or the fin cannot really connect to the internet. Its mostly DNS :'D. (just some food for thought, I find problems like that likewise fascinating as also horrible XD)
Well, this continues to be an interesting mystery. Last night I created a new build, and pushed it to the device using the balena CLI. This all seemed to go as planned, with first the build, and then the push – I could see the images downloading to the device, etc. Then – it just kinda “hung”. It looked like the 3 containers (mqtt, hass-configurator, and homeassistant) had all been ‘updated’ – but looking at the device in the console, it still showed “Target Release” as the newly created one - but “Current Release” still showed the previous. I ran out of steam, and ended up going to bed, thinking I’d check on it today.
This evening - it was still in the same state – but I was surprised to see the “online for” time being measured in minutes. That seemed odd. And still with the out-of-sync Current and Target release. So – crossing my fingers, I dared to reboot the device.
When it came back – finally, the ‘Current Release’ reported the newly-created build. After a few minutes of stabilization, low and behold… things started working. So… maybe it was a need for a new build of HomeAssistant – though I’m not quite sure why that would be causing the Fin to go offline, and behave quite the way it has.
For now, order is restored: Path lights will come on at dusk, Alexa integration is working for controlling various lights and routines, etc. Gonna be an uphill battle to regain “Wife Acceptance Factor” – she had already long-since resorted to turning things on and off with switches. Hard-earned, easily lost.
Here’s hoping this keeps working - so I can pivot back to the XLights / Falcon Player / Holiday LED lighting project. I’ve got all the raw materials (and even 3D-printed cases, etc) for two more instances of the build - so this year hoping for MOAR LEDs and some sync’d to audio sequences.
Not sure @SnoWake Maybe you can share some of the logs you get in here before granting support access i understand that you tried to connect it over Ethernet as well, right?
No - since it’s ‘headless’, and hanging on a wall in my basement – when it was non-functional (but still appearing online in the Balena console) - I just ignored it. Contemplated pulling it down and hooking up locally, but didn’t.
As to it’s network connectivity - it appears as “online” to my Ubiquiti network gear - and there’s an AP within a few feet of it, so I don’t expect a coverage problem. If I experience more troubles, I can easily drag an ethernet cable over from a switch port and see if that changes anything.
So far, so good: It’s been online and operational for a couple of days: Mobile apps, Alexa integration, automation routines all working as expected. I only wish I had a slightly deeper understanding of the ZigBee / ZWave configurations – as I’ve got a DEV instance running on a Pi4 w/ a second dongle: In the event of future troubles, would be awesome to have the config replicated, just be able to swap dongles and be up and running.
Device shows as online - but only for 5 hours (since 5:50am). Looking at the logs - there’s nothing shown during the timeframe that the device reports ‘coming online’. Off to check some logs on the OS container, and see what I can see. Any tips of where to look, @mpous ?
Dang. I do NOT understand what’s happening with this device! Every time I look at it, the the “Online for…” duration is something shorter than expected, and quite recent (often measured in hours, and never more than a few days).
I’ll go string a CAT6 cable to it right now, and see if that has any effect.
@mpous Any suggestions / #ProTips? I shelled into the box and poked around looking for logs / clues - and found absolutely nothing in /var/log on the host OS. The good stuff must be somewhere else…
@SnoWake do you have a Pi or similar that you can run on balenaCloud even on a fleet with no release deployed with the same cable? Just to check there is no problem with the connectivity?
Hey @mpous Well, I’ve got other devices here on the same WiFi network - and I was going to say that they’re rock solid and always connected. Then I went and checked, and - found a similar (but different) short duration that the device has been ‘connected’.
So - I thought hard-wiring the Fin might resolve the problem - but alas - it has not. I’ve got a fairly restrictive, and slightly complicated network here, with dedicated IoT SSID, an enterprise-class next-gen firewall, and … I’ll be darned if I can see any traffic being rejected / denied / blocked / dropped.
Earlier today, I noticed some of the containers weren’t running (and my home automation services weren’t working, earning me glares of disapproval from the wife :)) - so I decided to reboot the device. Now…
If / when it eventually comes back online (which I assume it will - even though I can make no rhyme or reason of why / when / how - there MUST be some logs, or telemetry I can look at, on the Fin itself, that will shed some light as to the nature of the disconnects?
In parallel: The device was provided my WiFi configuration when the image was burned. Now that I’ve got it hard-wired as well - it’s got two local addresses. In the event that one might be ‘preferred’ (can’t look at any routing tables right now ;)) or some WiFi trouble might be involved - how can I disable WiFi?
Dang - I just need to get this device online and functional long enough to do the Zigbee and ZWave exclusion of all my home automation devices. At that point, I can port them over to what was my “Dev” instance - a RPI4 running ‘vanilla’ HomeAssistant (no Balena) which has been online for more than six months straight. I kept being reluctant to do it when the device would come online and start working briefly - mostly just because the whole ZWave and Zigbee setup is such a pain.
Right now, the device is online - but of the three containers, the HomeAssistant container just shows ‘installed’ - and repeated log entries reading:
27.11.21 12:15:19 (-0800) Starting service 'homeassistant sha256:d3910c541289471904814a653a0bd8d54ce7fb86de2fa307e1cb5c395ed42918'
Pushing a new build, just in case there was some ‘breaking change’ in a recent update to HA – though, that actually doesn’t make sense, since my current build is pinned to a previous, known-working build. VERY strange - I’m definitely scratching my head over this one.
While the device is online - the HomeAssistant container just will not start. I tried pushing an fresh build: All three containers downloaded, and two of them started - but the HA container - still no dice.
Since I can’t connect to the container (not started) - where do I look in the Host OS for details / insights into why it’s not starting?
I stumbled upon this thread - which seemed ‘similar but different’ - and it revealed a command I would have never ‘stumbled upon’ despite decades as a unix geek and limited fluency with Docker and containers:
journalctl -au balena
Tons of stuff here - but the latest, seemingly relevant errors include:
Nov 28 19:38:52 12b11ac ade1d7b10bf1[1544]: [event] Event: Service start {“service”:{“appId”:1774679,“serviceId”:838645,">
Nov 28 19:38:52 12b11ac balenad[1544]: time=“2021-11-28T19:38:52.655491131Z” level=warning msg="Failed to allocate and map >
Nov 28 19:38:52 12b11ac balenad[1544]: time=“2021-11-28T19:38:52.992727924Z” level=error msg="bb50ff85e132078d7986308b1537d>
Nov 28 19:38:52 12b11ac balenad[1544]: time=“2021-11-28T19:38:52.993176669Z” level=error msg="Handler for POST /containers/>
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] Scheduling another update attempt in 900000ms due to failure: Error:>
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at fn (/usr/src/app/dist/app.js:6:8594)
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at runMicrotasks ()
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at processTicksAndRejections (internal/process/task_queues.js:97:>
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] Device state apply error Error: Failed to apply state transition step>
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at fn (/usr/src/app/dist/app.js:6:8594)
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at runMicrotasks ()
Nov 28 19:38:53 12b11ac ade1d7b10bf1[1544]: [error] at processTicksAndRejections (internal/process/task_queues.js:97:>
I’d be delighted to grant support access to the device, and the opportunity to learn!
Thanks in advance to any / all who might help me figure this out. Fairly desperate, at this point, just to get the config out of the device - but ideally, to ‘gracefully’ remove all my ZigBee / ZWave devices, so that I don’t have to go through the ‘–force’ mode on each of them.
I know I have seen that somewhere recently, but can’t remember where! Let me keep thinking.
P.S. - In regards to the time the device has been “Online” - That might be a bit of a red herring, as we have had API outages and service disruptions over the past days and weeks that would reset that timer. Even if the device was truly online and available, we would have tripped that count with outages.