Full disclosure, I work at Foxglove right now. Before joining, I spent over seve...

Lazaruscv · 2025-10-03T18:27:46 1759516066

This is super insightful, thank you for laying it out so clearly. Your point about the error surfacing way after it first occurred is exactly the sort of issue we’re interested in tackling. Foxglove is doing a great job with visualization and aggregation; what we’re thinking is more of a complementary diagnostic layer that:

• Correlates syslogs with mcap/bag file anomalies automatically

• Flags when a hardware failure might have begun (not just when it manifests)

• Surfaces probable root causes instead of leaving teams to manually chase timestamps

From your experience across 50+ clients, which do you think is the bigger timesink: data triage across multiple logs/files or interpreting what the signals actually mean once you’ve found them?

msadowski · 2025-10-03T18:52:12 1759517532

In my case, it’s definitely the data triage. Once I see the signal, I usually have ideas on what’s happening but I’ve been doing this for 11 years.

Maybe there could be value in signal interpretation for purely software engineers but I reckon it would be hard for such team to build robots.

Lazaruscv · 2025-10-03T19:05:14 1759518314

Our current thinking is to focus heavily on automating triage across syslogs and bag/mcap files, since that’s where the hours really get burned, even for experienced folks. For interpretation, we see it more as an assistive layer (e.g., surfacing “likely causes” or linking to past incidents), rather than trying to replace domain expertise.

Do you think there are specific triage workflows where even a small automation (say, correlating error timestamps across syslog and bag files) would save meaningful time?

msadowski · 2025-10-04T05:34:18 1759556058

One thing that comes to mind is checking the timestamps across sensors and other topics. Two cases come to mind:

* I was setting up Ouster lidar to use gos time, don’t remember the details now but it was reporting the time ~32 seconds in the past (probably some leap seconds setting?)

* I had a ROS node misbehaving in some weird ways - it turned out there was a service call to insert something into db and for some reason the db started taking 5+ minutes to complete which wasn’t really appropriate for a blocking call

I think the timing is one thing that needs to be consistently done right on every platform. The other issues I came across were very application specific.