Abstract

The currently leading artificial neural network (ANN) models of the visual ventral stream — which are derived from a combination of performance optimization and robustification methods — have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. Extending upon previous work, we show that not only can these models guide image perturbations that change the induced human category percepts, but they also can enhance human ability to accurately report the original ground truth. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) using image perturbations that aid recognition for novice learners. We find that combining these model-based strategies gives rise to test-time categorization accuracy gains of 33-72% relative to control subjects without these interventions, despite using the same number of training feedback trials. Surprisingly, beyond the accuracy gain, the training time for the augmented learning group was also shorter by 20-23%. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as tasks in two clinically relevant image domains — histology and dermoscopy — where visual learning is notoriously challenging. To the best of our knowledge, this is the first application of ANNs to increase visual learning performance in humans by enhancing category-specific features.


...
ANN models can predict image difficulty for humans, and increase recognizability through “enhancement.”
The left-hand side shows the relationship between ANN “confidence” and the probability of a human choosing the correct ground truth category (“bird,” “lizard,” etc) after viewing an image for 17 milliseconds. This shows that the model can predict which images are easy for humans to categorize and which are more difficult. The right-hand side shows how human recognition accuracy increases as “enhancement” perturbations from an ANN become larger in magnitude - the more enhancement, the easier the images become for humans to categorize.

Predicting image difficulty

We showed that robustified convolutional neural networks (CNNs) can make accurate predictions of image difficulty for humans. Specifically, the pre-softmax logit activation score corresponding to the ground truth class for a given image is strongly correlated with the rate of correct recognition by humans who are shown the image for 17 milliseconds (see left-hand side of figure above). This simple approach predicts image difficulty more accurately than previously developed metrics, including the equivalent logit scores from a non-robust model. Accurate image difficulty predictions enable us to select images at appropriate levels of difficulty for novice learners of visual tasks.

Enhancing images to reduce difficulty

Previous work has shown that robustified CNNs can be used to generate small-magnitude image perturbations that strongly disrupt image recognition by humans. We apply a similar approach to enhance images, by optimizing the pixel values of an image to maximize the model-derived logit score of the ground truth class, the same value that we also use to predict image difficulty. We demonstrated for the first time that models can be used to augment human visual category perception, by effectively making images easier to recognize as a particular category (see right-hand side of figure above).

Interactive visualization:

Below, you can explore how our enhancement approach affects images from different datasets1. Try adjusting the perturbation budget to see how the degree of enhancement affects the images. You can also see the difficulty score for each image predicted by a robust CNN (∈ [0,1], normalized by class). The difficulty scores correspond to the original, unmodified images, not to the enhanced versions.

Perturbation budget (ℓ₂ norm)
(ε = 15)

Boosting image category learning in humans

Humans learn most effectively when they are given an appropriate degree of challenge. Some visual tasks, such as interpreting certain kinds of medical images, are simply too difficult for humans to efficiently learn by practicing with typical examples. Using difficulty prediction and image enhancement as tools, we algorithmically designed curricula that start at a very easy level of difficulty and gradually increase the difficulty as learning progresses. Specifically, we select model-identified "easy" images for the beginning of the learning process, and subsequently allow more and more difficult images to be selected. We also enhance images (making them easier to recognize) with relatively large perturbations early on, and then gradually reduce the size of the perturbations. We showed that this strategy, which we call Logit-Weighted Image Selection and Enhancement (L-WISE), enables humans to learn faster and achieve higher scores when tested on held-out examples (unmodified, randomly selected images) across three different image categorization tasks (see figure below). Two of the tasks are relevant to clinical medicine: distinguishing among several different skin lesion types in dermoscopy images, and distinguishing between benign and malignant tissue in colon histology images.

...
We applied difficulty prediction and image enhancement to create an "easy to hard" sequence of images, enabling humans to learn visual tasks more quickly and to higher levels of accuracy.
The left-hand panels illustrate our "L-WISE" learning assistance approach: we limit the inherent difficulty of examples presented early in the curriculum (top-left), and enhance images with larger-magnitude perturbations early in the curriculum before decreasing the perturbation magnitude as learning progresses (bottom-left). The right-hand panel shows how participants assisted by L-WISE completed the curriculum with shorter training times, and with higher final test accuracy (on unmodified, randomly-selected images), across three different visual tasks.

BibTeX (preprint)

@inproceedings{talbot2025wise,
  title={L-WISE: Boosting Human Visual Category Learning Through Model-Based Image Selection And Enhancement},
  author={Talbot, Morgan B and Kreiman, Gabriel and DiCarlo, James J and Gaziv, Guy},
  booktitle={International Conference on Learning Representations},
  year={2025}
}


Footnotes
  1. Dataset sources:
    ImageNet (animals): Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
    iNaturalist (moths): Grant Van Horn and Oisin Mac Aodha. iNat Challenge 2021 - FGVC8. Kaggle, 2021.
    HAM10000 (dermoscopy): Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
    MHIST (histology): Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, et al. A petri dish for histopathology image analysis. In Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15–18, 2021, Proceedings, pp. 11–24. Springer, 2021.
Acknowledgements

This work was supported in part by Harvard Medical School under the Dean’s Innovation Award for the Use of Artificial Intelligence, in part by Massachusetts Institute of Technology through the David and Beatrice Yamron Fellowship, in part by the National Institute of General Medical Sciences under Award T32GM144273, in part by the National Institutes of Health under Grant R01EY026025, and in part by the National Science Foundation under Grant CCF-1231216. The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the above organizations. The authors would like to thank Andrei Barbu, Roy Ganz, Katherine Harvey, Michael J. Lee, Richard N. Mitchell, and Luke Rosedahl for sharing their helpful insights into our work at various times.