Almost a decade ago I ran into a couple of problems where our logs started to change qualitatively on a SaaS tool I was supporting, but not in a way that printed more error level messages in logs. When someone complained about incorrect behavior in our code we could see clearly in the logs that the wrong code paths were being engaged, and when that started- people had carefully logged everything that was happening! They just hadn't logged it at the "Error" level so we never got alerts about it. And it would have been impracticable to have that level of alerting, because the message wasn't an error, it was just that due to other code changes we were now going down the wrong path for this span.
So for a hackweek I built a tool to tokenize all of our log messages, and then grabbed all of our logs and built a gigantic n-dimensional vector for every five minute chunk of two days of those logs, then calculated the pythagorean difference for each of those five minute chunks, and looked at the biggest differences, most outlier five minute chunks. And they were all from 8-8:30AM CET on the two days (our company and most of our customers were US based, I just was looking at what timezones matched up to the interesting time). I said "okay, this looks interesting, let me see what is happening in the logs then" but it was impossible to figure out what the statistics were seeing. Because the math thinks in ways that human brains don't- it views the entire dataset simultaneously, and human brains just can't keep five minutes of busy log files in their working memory, but humans build narratives and the math can't understand that. So I ended up getting frustrated and giving up on the project. Because explaining in terms that I could understand and start debugging was the whole point of the project!
Wow, I'm actually really interested in how you did that. Did the differences between chunks correspond to changes in word/token frequencies? Or did you embed the logs into some other space first? This is a pretty nifty idea; the only analyses I've ran on production logs were just simple SQL queries.
I wrote a python script that parsed all of our back-end code, looking for every log message, and used that as the basis for a giant n-dimensional vector representing every possible log message. Then I looked at every five minute chunk of our logs, and incremented the appropriate row in the vector for every message printed, then I used a clustering algorithm (don't remember which one, sorry, but it something from PyTorch I'm pretty sure, not at this company any more so I don't have the code any more) to identify the five vectors that were farthest from any other vector. I first computed the n-dimensional pythag difference between each pair of five minute chunks and selected the vectors with the highest combined difference but the true clustering algorithm was a cleaner choice.
I did throw away a couple of messages, basically all the traffic from our uptime and health checks I threw away because they seemed like they would distort the data (if our health-checker went down that was the health-checkers fault, and we ought to get an actual alarm, having this alert would just be a duplicate alarm). Dunno, that might have been a mistake. It was a hackweek project, I'm not saying this was perfect- in fact, as I said, it never provided anything useful because it couldn't be explained!
In our actual log files on the two days I looked at, all five of the outlier five minute chunks were from 8-8:30 AM Central European Time (our logs were in UTC, I just looked at different time zones and that seemed the most likely source of interesting behavior). And then when I looked at those times in the logs- and also when I eyeballed the ~1000 dimensional vectors for them- I couldn't tell what the clustering algorithm was seeing, because it was 'thinking' so differently from how I do. It wasn't like one of the rows in the vector was suddenly 150 and then went to 0 outside that half-an-hour, it was hard to see any patterns, So I couldn't set this up to alerts or anything, because it would have produced a whole lot of wild goose chases without a lot of further refinement.
Talking with someone more expert in ML than I, he recommended trying to fine-tune a LLM to predict the next word (or maybe even the next log message depending on windows and verbosity of log messages) and potentially alerting when the differences between predicted and actual got too large. Maybe that would work, I dunno. But that was what I would have tried next hackweek, if I hadn't moved on from that company.
So for a hackweek I built a tool to tokenize all of our log messages, and then grabbed all of our logs and built a gigantic n-dimensional vector for every five minute chunk of two days of those logs, then calculated the pythagorean difference for each of those five minute chunks, and looked at the biggest differences, most outlier five minute chunks. And they were all from 8-8:30AM CET on the two days (our company and most of our customers were US based, I just was looking at what timezones matched up to the interesting time). I said "okay, this looks interesting, let me see what is happening in the logs then" but it was impossible to figure out what the statistics were seeing. Because the math thinks in ways that human brains don't- it views the entire dataset simultaneously, and human brains just can't keep five minutes of busy log files in their working memory, but humans build narratives and the math can't understand that. So I ended up getting frustrated and giving up on the project. Because explaining in terms that I could understand and start debugging was the whole point of the project!