Almost a decade ago I ran into a couple of problems where our logs started to ch...

jonathanyc · on Sept 1, 2024

Wow, I'm actually really interested in how you did that. Did the differences between chunks correspond to changes in word/token frequencies? Or did you embed the logs into some other space first? This is a pretty nifty idea; the only analyses I've ran on production logs were just simple SQL queries.

mandevil · on Sept 1, 2024

I wrote a python script that parsed all of our back-end code, looking for every log message, and used that as the basis for a giant n-dimensional vector representing every possible log message. Then I looked at every five minute chunk of our logs, and incremented the appropriate row in the vector for every message printed, then I used a clustering algorithm (don't remember which one, sorry, but it something from PyTorch I'm pretty sure, not at this company any more so I don't have the code any more) to identify the five vectors that were farthest from any other vector. I first computed the n-dimensional pythag difference between each pair of five minute chunks and selected the vectors with the highest combined difference but the true clustering algorithm was a cleaner choice.

I did throw away a couple of messages, basically all the traffic from our uptime and health checks I threw away because they seemed like they would distort the data (if our health-checker went down that was the health-checkers fault, and we ought to get an actual alarm, having this alert would just be a duplicate alarm). Dunno, that might have been a mistake. It was a hackweek project, I'm not saying this was perfect- in fact, as I said, it never provided anything useful because it couldn't be explained!

In our actual log files on the two days I looked at, all five of the outlier five minute chunks were from 8-8:30 AM Central European Time (our logs were in UTC, I just looked at different time zones and that seemed the most likely source of interesting behavior). And then when I looked at those times in the logs- and also when I eyeballed the ~1000 dimensional vectors for them- I couldn't tell what the clustering algorithm was seeing, because it was 'thinking' so differently from how I do. It wasn't like one of the rows in the vector was suddenly 150 and then went to 0 outside that half-an-hour, it was hard to see any patterns, So I couldn't set this up to alerts or anything, because it would have produced a whole lot of wild goose chases without a lot of further refinement.

Talking with someone more expert in ML than I, he recommended trying to fine-tune a LLM to predict the next word (or maybe even the next log message depending on windows and verbosity of log messages) and potentially alerting when the differences between predicted and actual got too large. Maybe that would work, I dunno. But that was what I would have tried next hackweek, if I hadn't moved on from that company.