Atd jobs getting almost all CPU resources RASPBERRY

Hey!

I’m running Resin on a RaspberryPi3. I have the same image commit running on many devices (>10). The CPU was getting very slow in some devices, so I debug it using htop. Thats when I saw that there was a process consuming almost 80% of the CPU time and the 4 cores were almost at 98% each when normally they should be at 70% top. That process was atd.

So I run atq to see the jobs running and there were a lot of strange jobs running. I removed a bunch of them using atrm jobID and the CPU rested. To see what the Jobs were I used at -c jobID and it seems to be some Resin stuff. The weird thing again is that all jobs throw the same output.

An example of what I got:

root@94e7c0d:/usr/src/app# atq
68      Wed Nov 22 10:00:00 2017 a root
21      Wed Nov 22 09:00:00 2017 a root
64      Wed Nov 22 10:00:00 2017 a root
26      Wed Nov 22 10:00:00 2017 a root
27      Wed Nov 22 09:00:00 2017 a root
72      Wed Nov 22 10:00:00 2017 a root
74      Wed Nov 22 10:00:00 2017 a root
35      Wed Nov 22 09:00:00 2017 a root
78      Wed Nov 22 10:00:00 2017 a root
52      Wed Nov 22 10:00:00 2017 a root
51      Wed Nov 22 09:00:00 2017 a root
47      Wed Nov 22 09:00:00 2017 a root
61      Wed Nov 22 09:00:00 2017 a root
29      Wed Nov 22 09:00:00 2017 a root
41      Wed Nov 22 09:00:00 2017 a root
9       Wed Nov 22 09:00:00 2017 a root
46      Wed Nov 22 10:00:00 2017 a root
54      Wed Nov 22 10:00:00 2017 a root
20      Wed Nov 22 10:00:00 2017 a root
50      Wed Nov 22 10:00:00 2017 a root
58      Wed Nov 22 10:00:00 2017 a root
80      Wed Nov 22 10:00:00 2017 a root
19      Wed Nov 22 09:00:00 2017 a root
18      Wed Nov 22 10:00:00 2017 a root
76      Wed Nov 22 10:00:00 2017 a root
30      Wed Nov 22 10:00:00 2017 a root
38      Wed Nov 22 10:00:00 2017 a root
75      Wed Nov 22 09:00:00 2017 a root
66      Wed Nov 22 10:00:00 2017 a root
10      Wed Nov 22 10:00:00 2017 a root
13      Wed Nov 22 09:00:00 2017 a root
15      Wed Nov 22 09:00:00 2017 a root
53      Wed Nov 22 09:00:00 2017 a root
60      Wed Nov 22 10:00:00 2017 a root
48      Wed Nov 22 10:00:00 2017 a root
28      Wed Nov 22 10:00:00 2017 a root
57      Wed Nov 22 09:00:00 2017 a root
11      Wed Nov 22 09:00:00 2017 a root
12      Wed Nov 22 10:00:00 2017 a root
43      Wed Nov 22 09:00:00 2017 a root
17      Wed Nov 22 09:00:00 2017 a root
39      Wed Nov 22 09:00:00 2017 a root
25      Wed Nov 22 09:00:00 2017 a root
24      Wed Nov 22 10:00:00 2017 a root
69      Wed Nov 22 09:00:00 2017 a root
31      Wed Nov 22 09:00:00 2017 a root
70      Wed Nov 22 10:00:00 2017 a root
67      Wed Nov 22 09:00:00 2017 a root
65      Wed Nov 22 09:00:00 2017 a root
42      Wed Nov 22 10:00:00 2017 a root
71      Wed Nov 22 09:00:00 2017 a root
40      Wed Nov 22 10:00:00 2017 a root
34      Wed Nov 22 10:00:00 2017 a root
56      Wed Nov 22 10:00:00 2017 a root
77      Wed Nov 22 09:00:00 2017 a root
45      Wed Nov 22 09:00:00 2017 a root
79      Wed Nov 22 09:00:00 2017 a root
44      Wed Nov 22 10:00:00 2017 a root
62      Wed Nov 22 10:00:00 2017 a root
37      Wed Nov 22 09:00:00 2017 a root
16      Wed Nov 22 10:00:00 2017 a root
63      Wed Nov 22 09:00:00 2017 a root
59      Wed Nov 22 09:00:00 2017 a root
32      Wed Nov 22 10:00:00 2017 a root
23      Wed Nov 22 09:00:00 2017 a root
22      Wed Nov 22 10:00:00 2017 a root
36      Wed Nov 22 10:00:00 2017 a root
root@94e7c0d:/usr/src/app# at -c 11
#!/bin/sh
# atrun uid=0 gid=0
# mail root 0
umask 22
NODE_VERSION=6.9.1; export NODE_VERSION
TIMEZONE=America/Los_Angeles; export TIMEZONE
RESIN_SUPERVISOR_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx; export RESIN_SUPERVISOR_API_KEY
OPERATION_START_HOUR=0900; export OPERATION_START_HOUR
HOSTNAME=94e7c0d; export HOSTNAME
RESIN_APP_ID=538246; export RESIN_APP_ID
BASE_REPO=cristian-nvidia-net-basic-kiwibot_real_local_3-0.0.3; export BASE_REPO
RESIN=1; export RESIN
RESIN_DEVICE_TYPE=raspberrypi3; export RESIN_DEVICE_TYPE
RESIN_SUPERVISOR_ADDRESS=http://127.0.0.1:48484; export RESIN_SUPERVISOR_ADDRESS
RESIN_WIFI_CONNECT_VERSION=2.0.7; export RESIN_WIFI_CONNECT_VERSION
USER=root; export USER
RESIN_SUPERVISOR_HOST=127.0.0.1; export RESIN_SUPERVISOR_HOST
RESIN_DEVICE_UUID=xxxxxxxxxxxxxxxxxxxxx; export RESIN_DEVICE_UUID
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin; export PATH
RESIN_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxx; export RESIN_API_KEY
PWD=/usr/src/app; export PWD
RESIN_SUPERVISOR_VERSION=6.1.3; export RESIN_SUPERVISOR_VERSION
RESIN_APP_RELEASE=xxxxxxxxxxxxxxxxxxxxxxxxxx; export RESIN_APP_RELEASE
RESIN_APP_NAME=Staging; export RESIN_APP_NAME
ROS_HOSTNAME=localhost; export ROS_HOSTNAME
RESIN_DEVICE_NAME_AT_INIT=kiwibot1; export RESIN_DEVICE_NAME_AT_INIT
OPERATION_FINISH_HOUR=1000; export OPERATION_FINISH_HOUR
SHLVL=1; export SHLVL
HOME=/root; export HOME
DEVICE_TYPE=raspberrypi3; export DEVICE_TYPE
RESIN_HOST_OS_VERSION=Resin\ OS\ 2.3.0+rev1; export RESIN_HOST_OS_VERSION
SOUNDS_URL=xxxxxxxxx; export SOUNDS_URL
SERVER_URL=xxxxxxxxxx; export SERVER_URL
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket; export DBUS_SYSTEM_BUS_ADDRESS
KIWIBOT_ID=1; export KIWIBOT_ID
RESIN_SUPERVISOR_PORT=48484; export RESIN_SUPERVISOR_PORT
cd /usr/src/app || {
         echo 'Execution directory inaccessible' >&2
         exit 1
}
#! /bin/bash

# This script will run at OPERATION_START_HOUR and at OPERATION_FINISH_HOUR. It will update the update lock for resin
# If the robot is in operational hours, prevent the robot to update

LOG_ROUTE=/data/updates-lock.log
LOCK_FILE_ROUTE=/tmp/resin/resin-updates.lock

CURRENT_HOUR="$(date +%H%M)"
if [ ${CURRENT_HOUR} -ge "${OPERATION_START_HOUR}" ] && [ ${CURRENT_HOUR} -lt "${OPERATION_FINISH_HOUR}" ]; then
        echo "$(date)    OPERATIONAL HOURS, WILL NOT UPDATE" >> ${LOG_ROUTE}
        if [ ! -f ${LOCK_FILE_ROUTE} ]; then
                echo "$(date)    CREATING resin-updates.lock" >> ${LOG_ROUTE}
                touch ${LOCK_FILE_ROUTE}
        fi
else
        echo "$(date)    OUT OF OPERATIONAL HOURS. WILL UPDATE IF UPDATE IS AVAILABLE" >> ${LOG_ROUTE}
        if [ -f ${LOCK_FILE_ROUTE} ]; then
                echo "$(date)    REMOVING resin-updates.lock" >> ${LOG_ROUTE}
                rm ${LOCK_FILE_ROUTE}
        fi
fi

All jobs output the same… Any idea whats happening? Why are so many jobs running and why is a random thing and not happen in all devices??

Thanks in advice!!

Hi @charlie, what resinOS version are you using?

For debugging would suggest flashing a .dev version of resinOS on the device (see more details what dev images are in the docs) and then you can connect to the hostOS with resin-cli: resin local ssh --host to investigate further.

Haven’t really seen cases when the host processes are taking over like that, the only time the system should be using any appreciable CPU or other resources is when it’s downloading the application update or host OS updates and applying them, but that’s not the case here, I guess.

I am using Resin OS 2.3.0+rev1. I havent used the .dev version yet, because I will not know what to check there.
Indeed there was not downloading the application update or host OS updates. It was some atd jobs messing around with the CPU and the content of them is some RESIN configuration stuff.

Well, the hunch was that in the host OS you might be able to check more what processes those jobs belong to? Wondering if there’s any more information on this, because it would be good to know a bit more specific info on what’s the resin service that might be causing the issue in your case…

Thanks! I indeed didnt realize something very important. I have a resin-start.sh that is executing at startup. That file launchs this:

# Create/delete lockfile on startup
/bin/bash -c "sh /usr/src/app/resin-updates.sh"

atd
# Create/delete lockfile when the operation starts/finish
at ${OPERATION_START_HOUR} -f /usr/src/app/resin-updates.sh
at ${OPERATION_FINISH_HOUR} -f /usr/src/app/resin-updates.sh

And resin-updates.sh looks like this:

#! /bin/bash

# This script will run at OPERATION_START_HOUR and at OPERATION_FINISH_HOUR. It will update the update lock for resin
# If the robot is in operational hours, prevent the robot to update

LOG_ROUTE=/data/updates-lock.log
LOCK_FILE_ROUTE=/tmp/resin/resin-updates.lock

CURRENT_HOUR="$(date +%H%M)"
if [ ${CURRENT_HOUR} -ge "${OPERATION_START_HOUR}" ] && [ ${CURRENT_HOUR} -lt "${OPERATION_FINISH_HOUR}" ]; then
	echo "$(date)    OPERATIONAL HOURS, WILL NOT UPDATE" >> ${LOG_ROUTE}
	if [ ! -f ${LOCK_FILE_ROUTE} ]; then
		echo "$(date)    CREATING resin-updates.lock" >> ${LOG_ROUTE}
        	touch ${LOCK_FILE_ROUTE}
	fi
else
	echo "$(date)    OUT OF OPERATIONAL HOURS. WILL UPDATE IF UPDATE IS AVAILABLE" >> ${LOG_ROUTE}
	if [ -f ${LOCK_FILE_ROUTE} ]; then
		echo "$(date)    REMOVING resin-updates.lock" >> ${LOG_ROUTE}
        	rm ${LOCK_FILE_ROUTE}
	fi
fi

The thing as said, sometimes randomly atd takes control of almost all CPU by having a lot of atd processes. But that happens not often and apparently random

So, if I understand correctly, all those tasks are coming from your user application?

I think they are coming from the user application. If you see first post it says root so I think its from the application indeed

Hi,
I would suggest to add some extra logs in your resin-updates.sh to monitor & verify how often it runs.
It might also be useful to log and clear the at daemon’s queue each time the resin-updates.sh runs.