The problem is that a sufficient black box description of a system is way more e...

The problem is that a sufficient black box description of a system is way more elaborate then the white box description of the system or even a rigorous description of all acceptable white boxes (a proof). Unit tests contain enough information to distinguish an almost correct system from a more correct one, but there is way more information needed to even arrive at the almost correct system. Also even the knowledge which traits likely separate an almost correct one from the correct one likely requires a lot of white box knowledge.

Unit tests are the correct tool, because going from an almost correct one to a correct one is hard, because it implies the failure rate to be zero and the lower you go the harder it is to reduce the failure rate any further. But when your constraint is not infinitesimal small failure rate, but reaching expressiveness fast, then a naive implementation or a mathematical model are a much denser representation of the information, and thus easier to generate. In practical terms, it is much easier to encode the slightly incorrect preconception you have in your mind, then try to enumerate all the cases in which a statistically generated system might deviate from the preconception you already had in your head.