Builder not using cache for unmodified build stages?

We have a large multicontainer application with several services. A few of them use Docker multistage builds to minimise the size of resulting images. I noticed recently that quite often whenever I make a small change to one service, the TypeScript transpilation step is ran for most services that didn’t have a single change. It seems to me that the reason is the builder doesn’t reuse cache for certain stages in the multistage build. An example of what I mean:

[radar-capture]    ---> 2f7e82b05296
[radar-capture]   Removing intermediate container a09e99ed0e54
[radar-capture]   Step 15/26 : RUN gulp build
[radar-capture]    ---> Running in 322e3b4b3922
[radar-capture]   [14:46:20] 
[radar-capture]   Using gulpfile /usr/src/app/gulpfile.js
[radar-capture]   [14:46:20] Starting 'build'...
[radar-capture]   [14:46:20] Starting 'clean'...
[radar-capture]   [14:46:20] Finished 'clean' after 7.15 ms
[radar-capture]   [14:46:20] Starting 'misc'...
[radar-capture]   [14:46:20] Finished 'misc' after 51 ms
[radar-capture]   [14:46:20] 
[radar-capture]   Starting 'config'...
[radar-capture]   [14:46:20] Finished 'config' after 19 ms
[radar-capture]   [14:46:20] Starting 'build'...
[radar-capture]   Files:                 173
[radar-capture]   Lines:               47440
[radar-capture]   Nodes:              196011
[radar-capture]   Identifiers:         65643
[radar-capture]   Symbols:             61008
[radar-capture]   Types:               17075
[radar-capture]   Memory used:        65792K
[radar-capture]   I/O Read time:       0.03s
[radar-capture]   Parse time:          1.02s
[radar-capture]   Program time:        1.55s
[radar-capture]   Bind time:           0.74s
[radar-capture]   Check time:          2.68s
[radar-capture]   transformTime time:  0.10s
[radar-capture]   Source Map time:     0.04s
[radar-capture]   commentTime time:    0.04s
[radar-capture]   I/O Write time:      0.03s
[radar-capture]   printTime time:      0.46s
[radar-capture]   Emit time:           0.46s
[radar-capture]   Total time:          5.43s
[radar-capture]   [14:46:27] Finished 'build' after 6.84 s
[radar-capture]   [14:46:27] Finished 'build' after 6.92 s
[radar-capture]    ---> 00599371819e
[radar-capture]   Removing intermediate container 322e3b4b3922
[radar-capture]   Step 16/26 : FROM balenalib/fincm3-alpine-node:10.14 as buildresult
[radar-capture]    ---> bf12ed2d51ab
[radar-capture]   Step 17/26 : WORKDIR /usr/src/app
[radar-capture]   Using cache
[radar-capture]    ---> 76717fc1a58b
[radar-capture]   Step 18/26 : COPY --from=dependencies /usr/src/app/node_modules node_modules
[radar-capture]   Using cache
[radar-capture]    ---> 15406bde0a86
[radar-capture]   Step 19/26 : COPY --from=transpile /usr/src/app/build ./
[radar-capture]   Using cache
[radar-capture]    ---> ef40490b6b61
[radar-capture]   Step 20/26 : COPY start.sh .
[radar-capture]   Using cache

The initial lines come from a transpile stage that shouldn’t be re-ran because there was no change to the files the Dockerfile depends on. This is reinforced by the fact that actually once that stage finishes, the layers for the next stage (buildresult) are read from cache.
Thoughts?

Hi,

can you share what happened before the RUN gulp build step?

[radar-capture]    ---> 2f7e82b05296
[radar-capture]   Removing intermediate container a09e99ed0e54
[radar-capture]   Step 15/26 : RUN gulp build

I can tell (from what I see) (Removing intermediate container a09e99ed0e54), step 14 was executed and thus the RUN gulp build was invalidated and executed again. It’s hard to say, because we don’t know what happened.

This is reinforced by the fact that actually once that stage finishes, the layers for the next stage (buildresult) are read from cache.

Check the Leverage build cache section. Especially this part:

For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.

In other words, for the COPY & ADD, checksums are calculated (+ permissions, …) and if they do match, cache is used even if the previous step was executed.

Here’s the full build output from a service (Please note that it’s a different service than before - I chose one with a bit cleaner output - but the same principle applies):

Step 1/25 : FROM balenalib/fincm3-alpine-node:10.14 as prebuild
 ---> bf12ed2d51ab
Step 2/25 : WORKDIR /usr/src/app
Using cache
 ---> 76717fc1a58b
Step 3/25 : RUN install_packages git
 ---> Running in 239cfb2c4975
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/armhf/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/armhf/APKINDEX.tar.gz
(1/5) Installing nghttp2-libs (1.32.0-r0)
(2/5) Installing libssh2 (1.8.2-r0)
(3/5) Installing libcurl (7.61.1-r2)
(4/5) Installing pcre2 (10.31-r0)
(5/5) Installing git (2.18.1-r0)
Executing busybox-1.28.4-r2.trigger
OK: 52 MiB in 71 packages
 ---> 218ddf0a6db0
Removing intermediate container 239cfb2c4975
Step 4/25 : COPY package.json package-lock.json .npmrc ./
 ---> 0da83cc786b1
Removing intermediate container 364b7188b836
Step 5/25 : FROM prebuild as dependencies
 ---> 0da83cc786b1
Step 6/25 : RUN JOBS=MAX npm install --production
 ---> Running in 6af673670019
npm WARN basestation-orchestration@1.0.0 No repository field.
npm WARN basestation-orchestration@1.0.0 No license field.

added 313 packages from 288 contributors and audited 83822 packages in 14.832s
found 5 vulnerabilities (2 moderate, 3 high)
  run `npm audit fix` to fix them, or `npm audit` for details
 ---> 260aac88f3b9
Removing intermediate container 6af673670019
Step 7/25 : FROM prebuild as transpile
 ---> 0da83cc786b1
Step 8/25 : ARG ENV=development
 ---> Running in fa526ec4f0fa
 ---> fb0e08261088
Removing intermediate container fa526ec4f0fa
Step 9/25 : RUN JOBS=MAX npm install -g gulp-cli
 ---> Running in 0cd5acc55709
/usr/local/bin/gulp -> /usr/local/lib/node_modules/gulp-cli/bin/gulp.js
+ gulp-cli@2.2.0
added 236 packages from 157 contributors in 13.068s
 ---> 3fc5e89d6b57
Removing intermediate container 0cd5acc55709
Step 10/25 : RUN JOBS=MAX npm install
 ---> Running in f6e2cf6fa665
> typemoq@2.1.0 postinstall /usr/src/app/node_modules/typemoq
> postinstall-build dist
npm WARN basestation-orchestration@1.0.0 No repository field.
npm
 WARN basestation-orchestration@1.0.0 No license field.

npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@1.2.7 (node_modules/fsevents):

npm WARN
 
notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for fsevents@1.2.7: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"arm"})



added 847 packages from 630 contributors and audited 83822 packages in 39.184s
found 5 vulnerabilities (2 moderate, 3 high)
  run `npm audit fix` to fix them, or `npm audit` for details
 ---> 6c3f26b8cb5e
Removing intermediate container f6e2cf6fa665
Step 11/25 : COPY config config/
 ---> d96d232cb84e
Removing intermediate container 4a955585c55f
Step 12/25 : COPY src src/
 ---> 1fc346cb931f
Removing intermediate container 9a3ce1510f3c
Step 13/25 : COPY environments environments/
 ---> da5d61029a07
Removing intermediate container 9aea0f2d6dd6
Step 14/25 : COPY gulpfile.js ioc.config tsconfig.json ./
 ---> 85f1aed53893
Removing intermediate container e8a9b0539a2c
Step 15/25 : RUN gulp build
 ---> Running in 360daffc4d9a
[15:46:14] 
Using gulpfile /usr/src/app/gulpfile.js
[15:46:14] 
Starting 'build'...
[15:46:14] 
Starting 'clean'...
[15:46:14] Finished 'clean' after 30 ms
[15:46:14] Starting 'misc'...
[15:46:14] Finished 'misc' after 68 ms
[15:46:14] Starting 'config'...
[15:46:14] Finished 'config' after 25 ms
[15:46:14] 
Starting 'build'...
Files:                 178
Lines:               49220
Nodes:              208250
Identifiers:         69981
Symbols:             64609
Types:               18301
Memory used:        74015K
I/O Read time:       0.04s
Parse time:          1.11s
Program time:        1.66s
Bind time:           0.80s
Check time:          3.07s
transformTime time:  0.11s
Source Map time:     0.05s
commentTime time:    0.08s
I/O Write time:      0.35s
printTime time:      1.00s
Emit time:           1.00s
Total time:          6.53s
[15:46:22] Finished 'build' after 8.01 s
[15:46:22] 
Starting 'rmSpec'...
[15:46:22] Finished 'rmSpec' after 94 ms
[15:46:22] Finished 'build' after 8.25 s
 ---> 5adcc86cb12d
Removing intermediate container 360daffc4d9a
Step 16/25 : FROM balenalib/fincm3-alpine-node:10.14 as buildresult
 ---> bf12ed2d51ab
Step 17/25 : WORKDIR /usr/src/app
Using cache
 ---> 76717fc1a58b
Step 18/25 : COPY --from=dependencies /usr/src/app/node_modules node_modules
Using cache
 ---> 8696db7ee71e
Step 19/25 : COPY --from=transpile /usr/src/app/build ./
Using cache
 ---> 3d7f45a7555e
Step 20/25 : COPY start.sh .
Using cache
 ---> b377b09ef5da
Step 21/25 : ARG ENV=development
Using cache
 ---> cf70d7c47f22
Step 22/25 : ARG DEBUG_PORT=9229
Using cache
 ---> ee09f71b5a70
Step 23/25 : ENV NODE_ENV $ENV
Using cache
 ---> 04862c759c6a
Step 24/25 : ENV DEBUG_PORT $DEBUG_PORT
Using cache
 ---> c0ee77bcdae0
Step 25/25 : CMD bash start.sh
Using cache
 ---> 6372f942021c
Successfully built 6372f942021c

and the whole Dockerfile.template here:

FROM balenalib/%%BALENA_MACHINE_NAME%%-alpine-node:10.14 as prebuild
WORKDIR /usr/src/app
# === add build build dependencies (system packages) here
# git is always required
RUN install_packages git
# ===
COPY package.json package-lock.json .npmrc ./

FROM prebuild as dependencies
RUN JOBS=MAX npm install --production
# === add runtime dependencies here

# ===

FROM prebuild as transpile
ARG ENV=development
RUN JOBS=MAX npm install -g gulp-cli
RUN JOBS=MAX npm install
COPY config config/
COPY src src/
COPY environments environments/
COPY gulpfile.js ioc.config tsconfig.json ./
RUN gulp build

FROM balenalib/%%BALENA_MACHINE_NAME%%-alpine-node:10.14 as buildresult
WORKDIR /usr/src/app
COPY --from=dependencies /usr/src/app/node_modules node_modules
COPY --from=transpile /usr/src/app/build ./
COPY start.sh .
# === add runtime set-up steps here
# e.g. ENV DBUS_SYSTEM_BUS_ADDRESS unix:path=/host/run/dbus/system_bus_socket
ARG ENV=development
ARG DEBUG_PORT=9229

ENV NODE_ENV=$ENV
ENV DEBUG_PORT=$DEBUG_PORT
# ===
CMD ["bash", "start.sh"]

I guess the main point was that the changes I have made were to a completely different service, and to my knowledge none of the files involved in building this services had any changes. Despite that, all steps from step 3 through to step 16 were actually re-run.

The cache seems to have been invalidated at the point where we RUN install_packages ... - and I am sure there were no changes to the list of packages installed. install_packages is a Balena-specific helper script. I am not sure how that works, but could there be something about it that invalidates the cache?

Are you 100% sure that you didn’t make a change in the following step?

Step 3/26 : RUN install_packages git python make gcc g++ linux-headers udev

Caching hit / miss decision for the RUN is based only on the command (string comparison), ie. the install_packages git python make gcc g++ linux-headers udev.

If you did make a change, it’s okay that all other steps were executed up to COPY (it uses different cache hit / miss mechanism).

But if you didn’t make a change in the step 3, it’s weird.

Yes, apologies, I also noticed the error just now, I posted the wrong Dockerfile contents. Will update the question shortly. Sorry for the confusion!

I edited my comment with the full build log. Now I chose a different service (a bit less steps / warnings) but the same problem appears there. If you can, I would suggest removing the latest comments in the thread to avoid confusion :slight_smile:

Okay, same question. You sure that you didn’t change the following step?

Step 3/25 : RUN install_packages git

Yes, I am sure. In fact, the Dockerfile itself hardly ever changes… Since most of the variation is just in the source and dependencies. We have a total of 11 services running on the device, with 8 of them simple Node+TypeScript projects based on that same or very similar Dockerfile structure. The other 3 don’t have multi-stage builds. Just to double check, I did another balena push today, with code that I knew contained no changes to any of the services. The 3 services took 1 second each. All the Node projects took around 3 minutes 8 seconds each. The behaviour was exactly the same for each of the Node projects - all steps from 3 to 16 were re-ran (hence the 3 minutes running time) and all the other steps, including the last built stage, were read from cache.

The differences I can see between the two groups of services is that the 8 long running ones have multi-stage builds where the 3 others don’t, the 8 are Node projects as opposed to the 3 being Python, and the 8 use balena’s install_packages script in the third step, as opposed to Alpine’s apk add explicitly used by the other 3. I cannot see other factors that could somehow affect this (and even those I listed seem a bit unlikely). I will switch to apk add to see what happens, but I agree it’s weird. Any ideas for something we’re doing wrong that could affect this?

hey @maciej.ldc,

I just tried and install_packages step is cached properly so it’s super weird here.
The command we use in install_packages is apk add --no-cache which equals apk update in the beginning and rm -rf /var/cache/apk/* in the end so it can invalidate the cache if an index that is changed.

Can you please try to do balena push then add make a small change to the Dockerfile.template after third step and push again to see if that step is cached or not?

The test you tried, was that in a multi-stage build? I don’t know if those factors interact but just to be sure that the situation is the same.

Also, could you please give an example of the change I should make, based on the Dockerfile I posted above? Right now I am struggling to see what precisely we are going to test by that, and so I am not really sure what change to make / where to make it. Should it be right after the third step?

Here is the Dockerfile I used for my test

FROM balenalib/armv7hf-alpine:3.8 as prebuild
RUN install_packages git
COPY . ./

FROM prebuild as dependencies
RUN echo 123

FROM prebuild as transpile
COPY . /

FROM balenalib/armv7hf-alpine:3.8 as buildresult
RUN echo 123
CMD ["bash", "start.sh"]

I don’t have the source so can’t perform the node build steps as yours but I think the situation is the same. And here is the build log:

[main]     Step 1/10 : FROM balenalib/armv7hf-alpine:3.8 as prebuild
[main]      ---> 1855cde3590d
[main]     Step 2/10 : RUN install_packages git
[main]     Using cache
[main]      ---> 1825dd2e6a56
[main]     Step 3/10 : COPY . ./
[main]      ---> 8e37534aa2d0
[main]     Removing intermediate container efe30a0afca8
[main]     Step 4/10 : FROM prebuild as dependencies
[main]      ---> 8e37534aa2d0
[main]     Step 5/10 : RUN echo 123
[main]      ---> Running in b092f9d6bf86
[main]     123
[main]      ---> 8a8e4d4659e5
[main]     Removing intermediate container b092f9d6bf86
[main]     Step 6/10 : FROM prebuild as transpile
[main]      ---> 8e37534aa2d0
[main]     Step 7/10 : COPY . /
[main]      ---> 52e4286826ea
[main]     Removing intermediate container 76bced16437f
[main]     Step 8/10 : FROM balenalib/armv7hf-alpine:3.8 as buildresult
[main]      ---> 1855cde3590d
[main]     Step 9/10 : RUN echo 123
[main]      ---> Running in 4fae55b7f214
[main]     123
[main]      ---> b29484c98485
[main]     Removing intermediate container 4fae55b7f214
[main]     Step 10/10 : CMD bash start.sh
[main]      ---> Running in b3924114ac47
[main]      ---> ece541f84491
[main]     Removing intermediate container b3924114ac47
[main]     Successfully built ece541f84491

you can see my second step is cached, it’s because I already did run a normal build (not multi-stage) before. Then I changed my Dockerfile a bit by adding an echo right after install_packages step.

FROM balenalib/armv7hf-alpine:3.8 as prebuild
RUN install_packages git
RUN echo 'new line added here!'
COPY . ./

FROM prebuild as dependencies
RUN echo 123

FROM prebuild as transpile
COPY . /

FROM balenalib/armv7hf-alpine:3.8 as buildresult
RUN echo 123
CMD ["bash", "start.sh"]

And here is the build log:

[Success]  Successfully pulled cache images
[main]     Step 1/11 : FROM balenalib/armv7hf-alpine:3.8 as prebuild
[main]      ---> 1855cde3590d
[main]     Step 2/11 : RUN install_packages git
[main]     Using cache
[main]      ---> 1825dd2e6a56
[main]     Step 3/11 : RUN echo 'new line added here!'
[main]      ---> Running in 706b6553bb14
[main]     new line added here!
[main]      ---> b5e6c335fae9
[main]     Removing intermediate container 706b6553bb14
[main]     Step 4/11 : COPY . ./
[main]      ---> a4cc360a31f6
[main]     Removing intermediate container f144e7d7d63b
[main]     Step 5/11 : FROM prebuild as dependencies
[main]      ---> a4cc360a31f6
[main]     Step 6/11 : RUN echo 123
[main]      ---> Running in 91deb356d73f
[main]     123
[main]      ---> b848503ac0c3
[main]     Removing intermediate container 91deb356d73f
[main]     Step 7/11 : FROM prebuild as transpile
[main]      ---> a4cc360a31f6
[main]     Step 8/11 : COPY . /
[main]      ---> 8e41501ef7ff
[main]     Removing intermediate container dde405e48468
[main]     Step 9/11 : FROM balenalib/armv7hf-alpine:3.8 as buildresult
[main]      ---> 1855cde3590d
[main]     Step 10/11 : RUN echo 123
[main]     Using cache
[main]      ---> 20d848387505
[main]     Step 11/11 : CMD bash start.sh
[main]     Using cache
[main]      ---> ece541f84491
[main]     Successfully built ece541f84491

As you can see, everything is cached properly here.

I am now wondering if this has something to do with https://github.com/moby/moby/issues/30081 - WORKDIR invalidating cache.

Just to double check, can you try setting WORKDIR in those initial build stages?

Hello there,

I’ve just attempted the same build using the Dockerfile Trong posted above, including adding a WORKDIR /usr/src/app before the RUN install_packages git command. Unfortunately, I see exactly the same result as he does, where the cache is still used for those commands and only gets busted on a new RUN command after it.

Have you been able to try the same test, it would be extremely useful to know whether it does or doesn’t work for you, to help us continue to try and track this down.

Best regards, Heds

Hi all,

I did a little experiment and have arrived at the weirdest result.
I basically switched steps 2 and 3:
it used to be

FROM balenalib/%%BALENA_MACHINE_NAME%%-alpine-node:10.14 as prebuild
WORKDIR /usr/src/app
# === add build build dependencies (system packages) here
# git is always required
RUN install_packages git
# ===
COPY package.json package-lock.json .npmrc ./

and now it’s

FROM balenalib/%%BALENA_MACHINE_NAME%%-alpine-node:10.14 as prebuild
# === add build build dependencies (system packages) here
# git is always required
RUN install_packages git
# ===
WORKDIR /usr/src/app
COPY package.json package-lock.json .npmrc ./

I did that on all relevant services, and it fixes the caching problem: pushing the changes once and then pushing them immediately again with no changes to see if the caching works - it does indeed… Reverting one of the services to the old ordering and again pushing twice shows that it’s broken once again for that service. The reason for why this happens is beyond me! For now I am happy to have fixed the timings issue and need to work on my project, but I will try to create a reproducible example based on our project template asap, so you can dig into it…

This is very strange. Using either definition does not break the cache for me. Could you maybe give us the full Dockerfile? Maybe there is something else that we are missing.

The full Dockerfile is in the thread above, but I think it will make more sense if I provide a whole project that resembles the one I am actually working on. I only need to set it up by removing company-specific code so it will take me a while.
The discussion in the moby issue on GitHub that I linked to previously (https://github.com/moby/moby/issues/30081) mentioned a similar problem with WORKDIR breaking cache, and people there said it didn’t reproduce on simpler Dockerfiles but did on more complex ones - perhaps this is also the case here.

Thanks for digging into this @maciej-ldc , if we can get a easy way to reproduce, it will make it much much simpler for us to get it fixed, so please do let us know once you have the project set up and we can get cracking :slight_smile: