Logs taking longer to load

Hey,

I just wanted to report a small issue that has been cropping up for me recently.

When I go to troubleshoot a location I normally will open up a bunch of devices from that location in a bunch of different tabs. I do this so I can audit the logs and run terminal commands quickly on batches of devices.

However, on about every 1 in 2 devices I am getting “No logs yet” in the logs pannel.

This message will often resolve within 5-10 minutes, and sometimes faster, but it still slows me down and even worse is highly inconsistent between refreshes and different devices.

I also suspected that the log feed was also failing on me for a certain amount of time yesterday.

Thanks and I hope the summit is going well!

-Thomas

Hello there,

Thanks for your report. We’ve actually seen this internally as well, and we’re currently working on investigating and resolving this. We’ll let you know as soon as we have more information for you!

Best regards,

Heds

Thanks @hedss

It never seems to cripple me, they always catch up eventually. :slight_smile:

Best of luck to you!
-Thomas

Hi Thomas,

This issue has been resolved, and you should now see log speeds greatly improved. Please let us know if this is not the case!

Best regards,

Heds

@hedss

Log speeds are beyond 10x improved. A report I whipped the other day up to read all my devices logs and report a basic status used to take around 4-5 hours to run. It just ran in just over 2 minutes. Fantastic!

Great job as always!
-Thomas

Hey @hedss

The issues is prevalent again.

Is it possible to get this added to the status page?

-Thomas

Hi, can you share with us support access to a device with slow logs? We’d like to investigate a bit more if possible, as we didn’t find anything suspicious on our end

Hey @thundron:

Finding one device to share with you is hard because the problem normally resolves itself and then doesn’t come back (maybe I am warming up the cache?)

I haven’t had time today to confirm that this list is constant between runs, but I added code to time and record the time it takes to pull logs.

But here are the access times per device:
https://pastebin.com/gYfaShYC

There is a long tail of good access times, but many that took much longer than the rest.

The code snippet used to pull logs:

def get_logs(device, trys=3):
    logs = None
    if trys <= 0:
        return  logs
    try:
        logs = balena.logs.history(device['uuid'], 200)
    except requests.exceptions.ReadTimeout:
        sleep_time = math.pow(2, 3-trys + 2)
        time.sleep(sleep_time)
        logs = get_logs(device, trys - 1)

    return logs

As you can see there is a retry, system built-in, of 4, 8, and 16 seconds.

The use case for this is reading all the logs every day and extracting metadata. It isn’t a bit deal but slows things down a few hours.

-Thomas

Hey @tacLog,

Thanks for all the info, we will investigate this further and let you know what we find. In the meantime, can you provide more details about the devices that have slow logs? Are they newly provisioned devices? Is it always the same devices that are slow, or it varies randomly? Any additional info would help a lot.

Thanks!

@sradevski

I wish I could provide more details.

Today when the report ran the average time it took to pull logs as was just alot lower today:
Here is the updated list

I think we might just be measuring load at any given time because I can’t find a pattern.

I will update, tomorrow, when I have time to do a more in-depth analysis.

Hey, just to check the pastebin logs are numbered by latency right, not by order of querying? Just want to make sure it’s not a pattern where it gets slower after you’ve run a load of queries. Apart from that we’re planning to roll out some changes that should improve load generally but I also want to track down if there’s any remaining performance issues in the implementation itself, so knowing if there’s a way to trigger the slowness would be huge for me

Hey @_Page

Yes I sorted them. Today, I will output two lists, one in the query order one sorted.

I will look for something reproducible as I collect more information.

-Thomas

Hey @_Page

I wish I followed up on this sooner.

No matter what I do, I can no longer reproduce this issue. All logs now load in sub-second times.

If you rolled something out, great job! If you didn’t, I am sorry, this will be a hard one to trace down.

Ether way,
Nice job!

-Thomas

Hey Thomas, it’s good to hear that this is not an issue anymore. We scaled up our redis instances, which is probably why you cannot reproduce it anymore. Do let us know if the issue appears again and you can reliably reproduce it.

Have a nice day!