The Bouba Kiki Effect and Stable Diffusion
The Bouba/Kiki effect is an old effect. It was discovered in 1924 and describes the face that most humans tend to associate the word “Bouba” with a more round shape, while “Kiki” is associated with a sharp shape.
The effect has broadly been replicated in many (but not all!) different cultures. While this is interesting from a linguistic and neurological standpoint, this is (mostly) a Computer science blog. So, what does this mean for us Computer Scientists?
In machine learning terms, the bouba/kiki effect is a generalization effect across different domains. For some reason we generalize an imageined property of a specific word/sound to a shape that might be described by that sound. But not only that! We do it systematically.
If we train a Machine Learning model, and then present data it has never seen before, its prediction could go either way. More formally, for a datapoint that is out-of-distribution, the prediction of a trained model should depend strongly on intialization, while the prediction of a datapoint that is in distribution should (optimally) depend on the data used for training.
This means that some information of the vowels and sound “bleeds” over into our categorization. Bouba is not completely out of distribution.
But what does this mean for modern Deep Learning Models? Current state of the art models are already able to hold conversations with you. How much bleed over does there exist between the words used and the concepts encoded? This is a complex and hotly debated topic.
For example, diffusion models are regularly criticised for exhibiting stereotypes in their generation. If you ask Stable Diffusion for an image of a pilot or scientist, it tends to produce male-bodied people. Back when this technology was first introduced, the developers of Dall-E began to add additional words to user prompts to make the generated images more diverse.
But in the abstract, this is simply a type of Bouba/kiki effect: The model believes that “pilot” tends to look like this. It is an error of generalization.
From here we are faced with a problem: How do we find and then disentangle such effects and in what cases should we do so? If you ask an Image generation model for a picture of a Kiki, what do you want it to produce?