2 multi-container Apps with the same code; one works and the other doesn't (image download problems?)

I am trying to deploy the same code (identical git hash) to two Applications. The first one worked, multi container deployment, everything is nice, The second one doesn’t/isn’t. There are some environment variables different between the two applications, but the only other significant difference I can see are that the connected devices are on different wifi networks. Of the failing application, one device is on a rather poor VDSL connection and the other one has a reasonably good VDSL connection.

The device with the poor connection is failing with a lot of logs like this:

10.11.18 10:57:28 (+1100) Downloading image 'registry2.balena-cloud.com/v2/aa5a2db06d3e56b4cf0da4cc64a7090f@sha256:98ee3a2a1bba61e2978e75c9d77a628152b6e88cb8d7dde89df7eebb6bdfb839'
10.11.18 10:57:55 (+1100) Failed to download image 'registry2.balena-cloud.com/v2/aa5a2db06d3e56b4cf0da4cc64a7090f@sha256:98ee3a2a1bba61e2978e75c9d77a628152b6e88cb8d7dde89df7eebb6bdfb839' due to '(HTTP code 500) server error - Get https://registry2.balena-cloud.com/v2/v2/aa5a2db06d3e56b4cf0da4cc64a7090f/manifests/sha256:98ee3a2a1bba61e2978e75c9d77a628152b6e88cb8d7dde89df7eebb6bdfb839: net/http: TLS handshake timeout '

And, predictably, the image downloads keep restarting. Sometimes the download restarts when the dashboard reaches 100% (same error), sometimes it’s part-way through download when it fails. Usually the image downloads seem to fail independently, but occasionally all 7 images fail in synchrony.

The failing device with a faster internet connection has all 7 images download, but it only starts one successfully and two keep restarting (postgres and redis) with complaints about missing files. The other 4 containers are not even trying to restart (status “Downloaded”), but that makes sense given the dependency tree between containers.

I’ve tried waiting patiently, rebooting, restarting, and poking buttons randomly. I’m not sure what else to try… can anyone give me some hints?

Hi, I’m afraid that given the error code and the fact that it’s a known network with connectivity issues, there’s nothing we can do about that really.

Are you seeing issues also on the device with the decent internet connection?

Well the Application with only good connection is still OK. The Application with one bad and one mediocre connection (totally different networks, in different parts of town) have now both downloaded all container images (the bad one took another 9 hours). However, they are failing to start their containers (postgres and redis are stuck in loops), and the containers that depend on them are not starting at all.

Postgres is looping like this:

10.11.18 20:17:03 (+1100)  postgres  LOG:  skipping missing configuration file "/var/lib/postgresql/data/postgresql.auto.conf"
10.11.18 20:17:03 (+1100)  postgres  postgres: could not find the database system
10.11.18 20:17:03 (+1100)  postgres  Expected to find it in the directory "/var/lib/postgresql/data",
10.11.18 20:17:03 (+1100)  postgres  but could not open file "/var/lib/postgresql/data/global/pg_control": No such file or directory

And the redis starts then exist like this

10.11.18 20:19:05 (+1100)  redis  1:C 10 Nov 09:19:05.026 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
10.11.18 20:19:05 (+1100)  redis  1:M 10 Nov 09:19:05.032 # Warning: 32 bit instance detected but no memory limit set. Setting 3 GB maxmemory limit with 'noeviction' policy now.
10.11.18 20:19:05 (+1100)  redis                  _._
10.11.18 20:19:05 (+1100)  redis             _.-``__ ''-._
10.11.18 20:19:05 (+1100)  redis        _.-``    `.  `_.  ''-._           Redis 3.2.12 (00000000/0) 32 bit
10.11.18 20:19:05 (+1100)  redis    .-`` .-```.  ```\/    _.,_ ''-._
10.11.18 20:19:05 (+1100)  redis   (    '      ,       .-`  | `,    )     Running in standalone mode
10.11.18 20:19:05 (+1100)  redis   |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
10.11.18 20:19:05 (+1100)  redis   |    `-._   `._    /     _.-'    |     PID: 1
10.11.18 20:19:05 (+1100)  redis    `-._    `-._  `-./  _.-'    _.-'
10.11.18 20:19:05 (+1100)  redis   |`-._`-._    `-.__.-'    _.-'_.-'|
10.11.18 20:19:05 (+1100)  redis   |    `-._`-._        _.-'_.-'    |           http://redis.io
10.11.18 20:19:05 (+1100)  redis    `-._    `-._`-.__.-'_.-'    _.-'
10.11.18 20:19:05 (+1100)  redis   |`-._`-._    `-.__.-'    _.-'_.-'|
10.11.18 20:19:05 (+1100)  redis   |    `-._`-._        _.-'_.-'    |
10.11.18 20:19:05 (+1100)  redis    `-._    `-._`-.__.-'_.-'    _.-'
10.11.18 20:19:05 (+1100)  redis        `-._    `-.__.-'    _.-'
10.11.18 20:19:05 (+1100)  redis            `-._        _.-'  
10.11.18 20:19:05 (+1100)  redis                `-.__.-'
10.11.18 20:19:05 (+1100)  redis
10.11.18 20:19:05 (+1100)  redis  1:M 10 Nov 09:19:05.034 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
10.11.18 20:19:05 (+1100)  redis  1:M 10 Nov 09:19:05.035 # Server started, Redis version 3.2.12
10.11.18 20:19:05 (+1100)  redis  1:M 10 Nov 09:19:05.039 * DB loaded from disk: 0.004 seconds
10.11.18 20:19:05 (+1100)  redis  1:M 10 Nov 09:19:05.039 * The server is now ready to accept connections on port 6379
10.11.18 20:19:06 (+1100)  redis  1:signal-handler (1541841546) Received SIGTERM scheduling shutdown...
10.11.18 20:19:06 (+1100)  redis  1:M 10 Nov 09:19:06.244 # User requested shutdown...
10.11.18 20:19:06 (+1100)  redis  1:M 10 Nov 09:19:06.244 * Saving the final RDB snapshot before exiting.

I don’t understand these loops. The docker-compose for containers is simple:

redis:                                                                                        
    container_name: redis                                                                       
    image: arm32v7/redis:3.2-stretch                                                            
    volumes:                                                                                    
      - redis-data:/data                                                                        
    restart: always                                                                             
                                                                                            
postgres:                                                                                     
    container_name: postgres                                                                    
    image: arm32v7/postgres:9                                                                   
    volumes:                                                                                    
      - postgres-data:/var/lib/postgresql/data                                                  
    restart: always

I was wondering if somehow the volumes were corrupted, missconfigured or wrong in some way. Restarting/Rebooting doesn’t help.

Is delta downloads enabled? If not can you enable and see if that helps downloading the code for those machines with the bad connectivity?