I tried to reproduce this - turns out the affected files weren't in the data set...

I tried to reproduce this - turns out the affected files weren't in the data sets recently released, but other files on the DOJ site (now taken down).

I guess the big take-away is scrape everything ASAP when it comes out. I haven't found any meaningful differences yet, but file hashes are different in the published data set zip files available today versus when Archive.org took a snapshot a few days ago.

I did write a bit of a tool which will detect and log and dump the text of affected PDF's, since redacting via drawing black boxes as well as using dark-colored highlights are both programmatically detectable. Pretty trivial to do so. Happy Holidays for anyone else who has the day off!