Both my RPi4 2GB's keep resetting their calculations 4GB runs fine

I have 3 RPi4’s running Balena’s boinc image, serving rosetta@home for a couple of days now. My 4GB unit runs fine, no issue but both the 2GB units get stuck with their calculations and seem to be resetting boinc again, over and over again. If I watch them I see them count upto a certain percentage for their 2 tasks and then connection is lost, shortly after they are back, starting the same calculations from zero.
From the F9 menu I am not able to throw away the ‘malfunctioning’ tasks, I can suspend them but that’s it.
Does anyone else have the same experience?

Hey @pe0ter welcome to the forums, thanks for trying the project and thanks for reporting this! I just wanted to confirm that I had been witnessing this on my own test devices as well, and we are going to roll out an update to address this. What we’ll be doing is restricting devices with under 2.5GB memory to only run a single task at a a time; we did have one method for doing this in place already so I’m surprised that your device managed to get two tasks! Details are here: https://github.com/balenalabs/rosetta-at-home/issues/29

Thanks, @chrisys for the info.
This morning I created 2 new 16GB SD cards for my 2GB RPi4’s and they are now running one task each. I also noticed my 4GB RPi4 had re-initialised itself without uploading any data and started from scratch with 4 tasks and 4 extra tasks on standby. I have stopped him and re-created his SD card as well. After startup he got 4 tasks to run and 4 on standby.
All in all this project using RPi’s looks ok from the outside but I doubt it will be of much use as I think I am not the only one having issues.
As a user I have no options to see it’s CPU load nor it’s temperature and I am not able to tune the amount af tasks it is running to try and keep it going if it has an issue so my only option is to re-image a card and throw away many many hours of calculation work the little guy has done so far.
So, I appreciate the team is busy improving stability but would appreciate some user control too to manage load.

<<<>>>
During dinner, all my systems ‘crashed’ again, none of them have uploaded calculation results yet as far as I can see from the logs displayed so I powered them all down and connected them to a network switch, see if that helps. If they still crash I will pack the lot back into the box they came from and only run folding@home on my Intel PC’s and MAC’s.

@pe0ter just to let you know (and others) know, we’ve recently deployed the latest version of the rosetta-at-home repo to address these issues that you’re describing (this may have been what you saw during dinner). The 2GB Pis should now only run 1 task concurrently, and the 4GB ones will limit the amount of RAM the client uses to 95% to preserve system stability. This update also included a change to ensure that in future updates, progress towards tasks is retained and stored on a persistent data volume.

However, if you would like more control, I’d recommend deploying the app from the repo to a balenaCloud account of your own, then you’ll have full control over the code on the device and be able to tune it as much as you like.

To reassure you though, the Fold for Covid team is making a great contribution; we’re ranked (at time of writing) in 47th place in the Rosetta project: https://boinc.bakerlab.org/rosetta/top_teams.php?sort_by=expavg_credit&offset=40 so massive thanks to everyone who has joined and is keeping their devices online! :slight_smile:

Again, thank you @chrisys and team for your effort.
Be careful writing everything to SD card as these cards are not made for continuous access and having RPi’s run stable but in the meantime destroying SD cards is not a good idea…
I will monitor my systems and keep you posted.

Another thought, I have a few 120GB SSD’s around with USB-3.0 interfaces, If I connect those to my RPi’s, would your image detect them and use them for swap and storage io the SD card?

If you boot the RPi4 from the SSD then certainly (as should be no need to have an SD card inserted). The instructions to do this are on the RPi website if you haven’t done it before - it is different on RPi4 than earlier models if I remember correctly.
UPDATE: Sorry I don’t think you can boot from USB on RPi4 yet without boot partition on an SD card then point to the USB device. This might not work for this project, I will give it a go…

@chrisys Sorry for my stupidity, but does this mean I have to download and recut the SD cards and reboot to get the updates?

@vinntec no, if you flashed the image from the Fold for Covid website your device would get the update (as long as it’s online) :+1:

1 Like

@chrisys ,
My 2GB units are working fine, finish tasks and get new ones, but my 4GB unit is in trouble. It just disappears and when it does, I can ping it but it refuses a browser connection. A power-down/power-up brings it back up again in my browser but only for a few hours. This happened 3 times today.

My Macbook had the last screen stil visible:

After reboot, the log in the bottom left corner shows nothing on what has happened, it just shows some data from last reboot, around 14:40 this afternoon but nothing on upload/download, etc.

The 4GB unit has the latest boot-rom-firmware installed and uses the balena image that was put online this morning.

Update 30/04/2020
'Stuck’again this morning, no way I could get in. It responded very fast on a ping (via Fing app) but told my browsers they were not authorised to make a connection. (iMAC, Macbook-Pro, 2 Windows-PC’s, iPad, iPhone)

After power-down/up screen looked like below

Seems like issues occur after tasks are finished and are being uploaded or when tasks are remotely deleted by the rosetta@home server.

I have configured my Pi4’s to always keep their network open (F9 - Activity - Network Activity Always Available)

Actually, I have to wonder if it is so deep in a calculation / simulation that it does not reply to the request for a page to be rendered in time. (Thus the browser just hanging). The 4gb devices attempt more jobs concurrently, and that may be playing a role. Also, ensure you have a high quality power supply, and I have added a heatsink and fan to my 4gb Pi, as mine gets very hot. Hope that helps!


Original 3.1Amp Raspberry Pi power supply and case with fan and heatsink

And again an Abort, followed by an automatic re-initialize

And a connection-refused

I will take my 4GB unit offline as it does not contribute to Rosetta@home proper this way, at least the 2GB units keep calculating and sending results but when the 4GB unit throws away its work all the time it is of no use to the project

@pe0ter interesting results. Could you try to see if you can achieve stability by limiting the Pi 4 4GB device to 3 or even 2 tasks? You’d have to modify the code and deploy the app in the manual way but it would be a valuable test that we could roll out to everyone and increase overall contributions if we can fix it.

I was running my 4 x RPi4 4GB with 8GB SD cards, but have rebooted them with 16GB cards which arrived today. Since then 3 of them are showing the web interface no problem [so far] but one of them is refusing connections so I am not sure what it is doing.

The other thing I noticed is that earlier one of them was flashing the green SD LED four times repeatedly. I powered it off and reseated the SD card and it seems OK now. Is this a known signal?

They are in an open rack with fan cooling but there is no sign of there being a problem.

@chrisys
Yesterday evening I created a new SD card image and used it on my RPi4 which I had installed into a housing with 7" RPi-touchdisplay and a bigger FAN. It is possible to connect keyboard and mouse to Pi but they do not work on this image, it is clearly designed to be a remote server. But the screen should show the same info the external browser does.
It booted fine and collected 8 tasks, 4 being executed.
After some 10-15min it froze, showing a red ‘offline’ on the top-line. From that point it was not possible to refresh communication in the browser, connection refused.
A reboot did not help, it ran upto an empty task screen with a red offline on the top, I could refresh browser connection but it did not want to connect to the server.
I then tried a slow/verified format of my SD card to FAT32 to make sure the card had no issues, then used balena to reimage with latest image and startup the RPI4 again.
No problem collecting 8 tasks, running 4 but after some 10-15 min freeze. After again some 5 min display faded away it rebooted by itself and showed the task screen but did not calculate, then again an automatic reboot with same results.
Next a new reimage of the SD card and as it had collected it’s 8 tasks I started suspending calculations via the F9 menu - task - suspend (S). This worked but the little bugger instantly started new tasks from it’s standby list as soon as I suspended one. Eventually I had him running only 1 task with 7 suspended.
The result was the same, after some 10-15 min freeze with red offline. After manual reboot empty task screen and no remote connection possible.

So now I ran out of options, it cannot be a load problem as it still crushes with only one task running so I suspect it to be a memory-load problem.

Would it be possible for you guys to create an image where before I download it I can set the amount of tasks I would like my RPi4 to download and run so I can control a bit what it is doing?
Currently I can set Wifi SSID and password but an extra field should not be an issue as you DO control the amount of tasks done when memory is only 2GB so you can set the amount of tasks?

@pe0ter While trying to figure out how to use an SSD yesterday (failed because RPi4 still needs a boot SD card), I had keyboard and mouse connected (do nothing) but the monitor on HDMI was showing same screen as the web interface after booting.

I too am loosing the web interface (I only have RPi4 4GB) but every so often it comes back and since I put 16GB SD cards in they are lasting longer and recovering better than before (this is not scientific as it might be a coincidence). The trouble is figuring out which of my 4 RPi4 to put the monitor on as it would be easier to see what it is doing when the web interface dies (temporarily or otherwise). I will give it a go but means rebooting one of them as I have to get it out of the rack to connect it.

@vinntec Thank’s for the head’s-up.
I am only using 16GB SD cards, SanDisk Ultra, as they are the best ‘fit-for-all’ solution for most RPi applications I am playing with. So I have no experience with smaller cards in this application.
Yesterday evening I had a situation where the RPi4 7"display showed a task screen with red ‘offline’ and no tasks but the web-browser connection showed it WAS running 4 tasks and updating it’s progress, at least for some 5 min, before it froze again. When it freezes, the processor cooling fins feel cold.
I have a power socket power monitor somewhere, I will try to find it today and connect my 4 RPi4’s to it so I can see if they work based on power consumption. Just hope I remember where I left it…

update 11:30 CET

Found it, pulled the power plugs and hooked-up 3x RPi4-2GB and 1x RPi4-4GB to my power monitor. After reboot total power load was about 18.5W.
The 3 RPi4-2GB units continued where they had left calculating one task each but the RPi4-4GB showed online with 7 tasks suspended. (where was the 8th task gone, has it been completed over night or is it lost? Unit showed offline this morning.)

So I activated 1 task and watched it run for some 15min, no problem, power had increased some 0,4W, fluctuating a bit. Then activated a second task and let it run, again no problem and power increased by about another 0,4W. Same for 3rd task.
At the moment everything seems to be running stable, total power consumption fluctuates between 19,8 and 20,5W, I’ll leave it for now.

update 19:20 CET

RPi4-4GB had finished it’s 3 given tasks and uploaded the results after which I activated 3 other tasks for it. Seems to run flawless now (fingers crossed)

1 Like

So far I have been running one of my RPi4 4GB with a monitor attached and it is working fine, except that it reboots every so often (on web it goes offline but can be reconnected quickly). I can’t tell if this is normal or not as after reboot the messages are lost - I am using standard image so have a very limited view of things. A different RPi4 4GB is refusing web connections currently. 7 running altogether now - 4 x RPi4 4GB + 3 x RPi3.

@chrisys @vinntec
This morning I found my RPi4-4G stuck again, after finishing the 3 tasks given last night it had grabbed itself a full load of tasks, started executing 4 of them and got stuck after a while. It seems RPi4-4G runs stable executing only 3 tasks but is likely to get stuck with 4 tasks.

That’s interesting as two of of mine are now running 3 tasks, so maybe this tweak has already been made?