Human EEG and artificial neural networks reveal disentangled representations and processing timelines of object real-world size and depth in natural images
Abstract
Remarkably, human brains have the ability to accurately perceive and process the real-world size of objects, despite vast differences in distance and perspective. While previous studies have delved into this phenomenon, distinguishing the processing of real-world size from other visual properties, like depth, has been challenging. Using the THINGS EEG2 dataset with human EEG recordings and more ecologically valid naturalistic stimuli, our study combines human EEG and representational similarity analysis to disentangle neural representations of object real-world size from retinal size and perceived depth, leveraging recent datasets and modeling approaches to address challenges not fully resolved in previous work. We report a representational timeline of visual object processing: object real-world depth processed first, then retinal size, and finally, real-world size. Additionally, we input both these naturalistic images and object-only images without natural background into artificial neural networks. Consistent with the human EEG findings, we also successfully disentangled representation of object real-world size from retinal size and real-world depth in all three types of artificial neural networks (visual-only ResNet, visual-language CLIP, and language-only Word2Vec). Moreover, our multi-modal representational comparison framework across human EEG and artificial neural networks reveals real-world size as a stable and higher-level dimension in object space incorporating both visual and semantic information. Our research provides a temporally resolved characterization of how certain key object properties – such as object real-world size, depth, and retinal size – are represented in the brain, which offers further advances and insights into our understanding of object space and the construction of more brain-like visual models.
Introduction
Imagine you are viewing an apple tree while walking around an orchard: as you change your perspective and distance, the retinal size of the apple you plan to pick varies, but you still perceive the apple as having a constant real-world size. How do our brains extract object real-world size information during object recognition to allow us to understand the complex world? Behavioral studies have demonstrated that perceived real-world size is represented as an object physical property, revealing same-size priming effects (Setti et al., 2009), familiar-size stroop effects (Konkle and Oliva, 2012a; Long and Konkle, 2017), and canonical visual size effects (Chen et al., 2022; Konkle and Oliva, 2011). Human neuroimaging studies have also found evidence of object real-world size representation (Huang et al., 2022; Khaligh-Razavi et al., 2018; Konkle and Caramazza, 2013; Konkle and Oliva, 2012b; Luo et al., 2023; Quek et al., 2023; Wang et al., 2022a). These findings suggest real-world size is a fundamental dimension of object representation.
However, previous studies on object real-world size have faced several challenges. Firstly, the perception of an object’s real-world size is closely related to the perception of its real-world distance in depth. For instance, imagine you are looking at photos of an apple and a basketball: if the two photos were zoomed in such that the apple and the basketball filled the same exact retinal (image) size, you could still easily perceive that the apple is the physically smaller real-world object. But you would simultaneously infer that the apple is thus located closer to you (or the camera) than the basketball. In previous neuroimaging studies of perceived real-world size (Huang et al., 2022; Konkle and Caramazza, 2013; Konkle and Oliva, 2012b), researchers presented images of familiar objects zoomed and cropped such that they occupied the same retinal size, finding that neural responses in ventral temporal cortex reflected the perceived real-world size (e.g. an apple smaller than a car). However, while they controlled the retinal size of objects, the intrinsic correlation between real-world size and real-world depth in these images meant that the influence of perceived real-world depth could not be entirely isolated when examining the effects of real-world size. This makes it difficult to ascertain whether their results were driven by neural representations of perceived real-world size and/or perceived real-world depth. MEG and EEG studies focused on temporal processing of object size representations Khaligh-Razavi et al., 2018; Wang et al., 2022a have been similarly susceptible to this limitation. Indeed, one recent study (Quek et al., 2023) provided evidence that perceived real-world depth could influence real-world size representations, further illustrating the necessity of investigating pure real-world size representations in the brain. Secondly, the stimuli used in these studies were cropped object stimuli against a plain white or gray background, which are not particularly naturalistic. More and more studies and datasets have highlighted the important role of naturalistic context in object recognition (Allen et al., 2022; Gifford et al., 2022; Grootswagers et al., 2022; Hebart et al., 2019; Stoinski et al., 2024). In ecological contexts, inferring the real-world size/distance of an object likely relies on a combination of bottom-up visual information and top-down knowledge about canonical object sizes for familiar objects. Incorporating naturalistic background context in experimental stimuli may produce more accurate assessments of the relative influences of visual shape representations (Bracci et al., 2017; Bracci and Op de Beeck, 2016; Proklova et al., 2016) and higher-level semantic information (Doerig et al., 2022; Huth et al., 2012; Wang et al., 2022b). Furthermore, most previous studies have tended to categorize size rather broadly, such as merely differentiating between big and small objects (Khaligh-Razavi et al., 2018; Konkle and Oliva, 2012b; Wang et al., 2022a) or dividing object size into seven levels from small to big. To more finely investigate the representation of object size in the brain, it may be necessary to obtain a more continuous measure of size for a more detailed characterization.
Certainly, a minority of fMRI studies have attempted to utilize natural images and also engaged in more detailed size measurements to more precisely explore the encoding of object real-world size in different brain areas (Luo et al., 2023; Troiani et al., 2014). However, no study has yet comprehensively overcome all the challenges and unfolded a clear processing timeline for object retinal size, real-world size, and real-world depth in human visual perception.
In the current study, we overcome these challenges by combining human EEG recordings, naturalistic stimulus images, artificial neural networks, and computational modeling approaches including representational similarity analysis (RSA) and partial correlation analysis to distinguish the neural representations of object real-world size, retinal size, and real-world depth. We applied our integrated computational approach to an open EEG dataset, THINGS EEG2 (Gifford et al., 2022). Firstly, the visual image stimuli used in this dataset are more naturalistic and include objects that vary in real-world size, depth, and retinal size. This allows us to employ a multi-model representational similarity analysis to investigate relatively unconfounded representations of object real-world size, partialing out – and simultaneously exploring – these confounding features. Secondly, we are able to explore the neural dynamics of object feature processing in a more ecological context based on natural images in human object recognition. Thirdly, instead of categorizing object size into discrete levels, we applied a more continuous measure based on detailed behavioral measurements from an online size rating task, allowing us to more finely decode the representation of object size in the brain.
We first focus on unfolding the neural dynamics of statistically isolated object real-world size representations. The temporal resolution of EEG allows us the opportunity to investigate the representational time course of visual object processing, asking whether processing of perceived object real-world size precedes or follows processing of perceived depth, if these two properties are in fact processed independently.
We then attempt to further explore the underlying mechanisms of how human brains process object size and depth in natural images by integrating artificial neural networks (ANNs). In the domain of cognitive computational neuroscience, ANNs offer a complementary tool to study visual object recognition, and an increasing number of studies support that ANNs exhibit representations similar to human visual systems (Cichy et al., 2016; Güçlü and van Gerven, 2015; Yamins et al., 2014; Yamins and DiCarlo, 2016). Indeed, a recent study found that ANNs also represent real-world size (Huang et al., 2022); however, their use of a fixed retinal size image dataset with the same cropped objects as described above makes it similarly challenging to ascertain whether the results reflected real-world size and/or depth. Additionally, some recent work indicates that artificial neural networks incorporating semantic embedding and multimodal neural components might more accurately reflect human visual representations within visual areas and even the hippocampus, compared to vision-only networks (Choksi et al., 2022a; Choksi et al., 2022b; Conwell et al., 2022; Doerig et al., 2022; Jozwik et al., 2023; Wang et al., 2022b). Given that perception of real-world size may incorporate both bottom-up visual and top-down semantic knowledge about familiar objects, these models offer yet another novel opportunity to investigate this question. Utilizing both visual and visual-semantic models, as well as different layers within these models, ANNs provide us the approach to extract various image features, low-level visual information from early layers, and higher-level information including both visual and semantic features from late layers.
The integrated computational approach by cross-modal representational comparisons we take with the current study allows us to compare how representations of perceived real-world size and depth emerge in both human brains and artificial neural networks. Unraveling the internal representations of object size and depth features in both human brains and ANNs enables us to investigate how distinct spatial properties—retinal size, real-world depth, and real-world size—are encoded across systems, and to uncover the representational mechanisms and temporal dynamics through which real-world size emerges as a potentially higher-level, semantically grounded feature.
Results
We conducted a cross-modal representational similarity analysis (Figures 1 and 2, see Materials and methods section for details) comparing the patterns of human brain activation while participants viewed naturalistic object images (timepoint-by-timepoint decoding of EEG data), the output of different layers of artificial neural networks and semantic language models fed the same stimuli (ANN and Word2Vec models), and hypothetical patterns of representational similarity based on behavioral and mathematical measurements of different visual image properties (perceived real-world object size, displayed retinal object size, and inferred real-world object depth).
Dynamic representations of object size and depth in human brains
To explore if and when human brains contain distinct representations of perceived real-world size, retinal size, and real-world depth, we constructed timepoint-by-timepoint EEG neural RDMs (Figure 2A), and compared these to three hypothesis-based RDMs corresponding to different visual image properties (Figure 2B). Firstly, we confirmed that the hypothesis-based RDMs were indeed correlated with each other (Figure 3A), and without accounting for the confounding variables, Spearman correlations between the EEG and each hypothesis-based RDM revealed overlapping periods of representational similarity (Figure 3B). In particular, representational similarity with real-world size (from 90 to 120ms and from 170 to 240ms) overlapped with the significant time windows of other features, including retinal size from 70 to 210ms, and real-world depth from 60 to 130ms and from 180 to 230ms. But critically, with the partial correlations, we isolated their independent representations. The partial correlation results reveal a relatively unconfounded representation of object real-world size in the human brain from 170 to 240ms after stimulus onset, independent from retinal size and real-world depth, which showed significant representational similarity at different time windows (retinal size from 90 to 200ms, and real-world depth from 60 to 130ms and 270–300ms; Figure 3D).
Peak latency results showed that neural representations of real-world size, retinal size, and real-world depth reached their peaks at different latencies after stimulus onset (real-world depth: ~87ms, retinal size: ~138ms, real-world size: ~206ms, Figure 3C). The representation of real-world size had a significantly later peak latency than that of both retinal size, t(9)=4.30, p=0.002, and real-world depth, t(9)=18.58, p<0.001. And retinal size representation had a significantly later peak latency than real-world depth, t(9)=3.72, p=0.005. These varying peak latencies imply an encoding order for distinct visual features, transitioning from real-world depth through retinal size, and then to real-world size.
Artificial neural networks also reflect distinct representations of object size and depth
To test how ANNs process these visual properties, we input the same stimulus images into ANN models and got their latent features from early and late layers (Figure 2C), and then conducted comparisons between the ANN RDMs and hypothesis-based RDMs. Parallel to our findings of dissociable representations of real-world size, retinal size, and real-world depth in the human brain signal, we also found dissociable representations of these visual features in ANNs (Figure 3E). Our partial correlation RSA analysis showed that early layers of both ResNet and CLIP had significant real-world depth and retinal size representations, whereas the late layers of both ANNs were dominated by real-world size representations, though there was also weaker retinal size representation in the late layer of ResNet and real-world depth representation in the late layer of CLIP (additional results of the extended analysis of multiple layers in ResNet and CLIP are shown in Figure 3—figure supplement 1 (https://elifesciences.org/articles/98117/figures#fig3s1) ). The detailed statistical results are shown in Supplementary file 1, table A (https://elifesciences.org/articles/98117/figures#supp1) .
Thus, ANNs provide another approach to understand the formation of different visual features, offering convergent results with the EEG representational analysis, where retinal size was reflected most in the early layers of ANNs, while object real-world size representations didn’t emerge until late layers of ANNs, consistent with a potential role of higher-level visual information, such as the semantic information of object concepts.
Representational similarity between human EEG and artificial neural networks
To directly examine the representational similarity between ANNs and human EEG signals, we compared the timepoint-by-timepoint EEG neural RDMs and the ANN RDMs. This analysis allowed us to assess how different stages of visual processing in the human brain align temporally with hierarchical representations in ANNs. As shown in Figure 3F, the early layer representations of both ResNet and CLIP (ResNet.maxpool layer and CLIP.visual.avgpool) showed significant correlations with early EEG time windows (early layer of ResNet: 40–280ms, early layer of CLIP: 50–130ms and 160–260ms), while the late layers (ResNet.avgpool layer and CLIP.visual.attnpool layer) showed correlations extending into later time windows (late layer of ResNet: 80–300ms, late layer of CLIP: 70–300ms). Although there is substantial temporal overlap between early and late model layers, the overall pattern suggests a rough correspondence between model hierarchy and neural processing stages.
We further extended this analysis across intermediate layers of both ResNet and CLIP models (from early to late, ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; from early to late, CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure 3—figure supplement 2 (https://elifesciences.org/articles/98117/figures#fig3s2) , show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages. This pattern of early-to-late correspondence aligns with previous findings that convolutional neural networks exhibit similar hierarchical representations to those in the brain visual cortex (Cichy et al., 2016; Güçlü and van Gerven, 2015; Kietzmann et al., 2019; Yamins and DiCarlo, 2016): that both the early stage of brain processing and the early layer of the ANN encode lower-level visual information, while the late stage of the brain and the late layer of the ANN encode higher-level visual information. Notably, early brain responses showed stronger similarity to early ResNet layers than to CLIP layers, consistent with prior work suggesting that early visual processing is more closely aligned with purely visual models (Greene and Hansen, 2020). In contrast, at later time windows, brain activity more closely resembled late CLIP layers, possibly reflecting the integration of visual and semantic information. However, it is also possible that these differences between ResNet and CLIP reflect differences in training data scale and domain.
To contextualize how much of the shared variance between EEG and ANN representations is driven by the specific visual object features we tested above, we conducted a partial correlation analysis between EEG RDMs and ANN RDMs controlling for the three hypothesis-based RDMs (Figure 3—figure supplement 3 (https://elifesciences.org/articles/98117/figures#fig3s3) ). The EEG ×ANN similarity results remained largely unchanged, suggesting that ANN representations capture much more additional rich representational structure beyond these features. Similarly, a