I feel like a neural network is probably overkill for this task and a suboptimal substitute for a theoretical understanding of optical illusions, but can't argue with results.
Most of them are not “illusions” where you perceive two identical segments being different lengths because of tricks of human perception, they are ambigrams. They rely on humans’ ability to think of any three dots as two eyes and a mouth.
They also “copy” the way those networks seem to do so often that they somehow get copyright strikes; they were either prompted on existing solutions or learned them whole through training:
* The penguin and giraffe one is a previously known ambigram, for example.
* The old lady turning into a dress is obviously based on a classic pencil drawing where a similar old lady hiding in her collar turns into a young lady looking behind her shoulder [0]; however, the network interpreted “young lady” and turned into a white dress because color-matching the two different body parts from the pencil outline and turning it photorealistic wouldn’t have been much harder otherwise. There are photorealistic interpretations, though [1].
I’m more impressed by the radically new ones, like the fire flipping into a face—but most of those rely on having two distinct parts of the image be meaningful in their own context, and not relevant otherwise.
The black-and-white inversion man/woman is impressive because the two interpretations are not on separate parts of the image. That’s where you can interpret the quality of the effect as the model having learned how humans perceive and pay attention to dark and light contrasts differently. That one captures an understanding of perception.
I can do you even better. I've made an entire game [1] based off of multistable perception [2]. Sugihara out did me by finding optical illusions with triple interpretations [3]. A solid half of MC Escher's work was about the study of tiling wherein both the negative space and the object space could be interchanged [4].
These things aren't a mystery. There are principles you can work from to produce such multi-stable illusions in formulaic, computer generated ways without resorting to the technical debt of a neural net. But, as with so much in modern times, training a neural net gets results faster than distilling a true understanding and then translating your understanding into code.