1692

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

Learning Computational Models of Video Memorability from fMRI Brain Imaging Junwei Han, Changyuan Chen, Ling Shao, Senior Member, IEEE, Xintao Hu, Jungong Han, and Tianming Liu

Abstract—Generally, various visual media are unequally memorable by the human brain. This paper looks into a new direction of modeling the memorability of video clips and automatically predicting how memorable they are by learning from brain functional magnetic resonance imaging (fMRI). We propose a novel computational framework by integrating the power of low-level audiovisual features and brain activity decoding via fMRI. Initially, a user study experiment is performed to create a ground truth database for measuring video memorability and a set of effective low-level audiovisual features is examined in this database. Then, human subjects’ brain fMRI data are obtained when they are watching the video clips. The fMRI-derived features that convey the brain activity of memorizing videos are extracted using a universal brain reference system. Finally, due to the fact that fMRI scanning is expensive and time-consuming, a computational model is learned on our benchmark dataset with the objective of maximizing the correlation between the low-level audiovisual features and the fMRI-derived features using joint subspace learning. The learned model can then automatically predict the memorability of videos without fMRI scans. Evaluations on publically available image and video databases demonstrate the effectiveness of the proposed framework. Index Terms—Audiovisual features, brain imaging, semantic gap, video memorability (VM).

I. I NTRODUCTION HE fast-growing digital video media make our life more colorful. We may enjoy dozens or even hundreds of video clips everyday while watching TV, browsing the internet, reading multimedia magazines, watching films, etc. However, the memorability of various videos is generally not equal. We can remember some videos for a long time while forgetting others quite soon. Two interesting questions naturally arising

T

Manuscript received August 19, 2013; revised April 4, 2014 and August 12, 2014; accepted September 1, 2014. Date of publication October 9, 2014; date of current version July 15, 2015. This work was supported by the National Science Foundation of China under Grant 61103061, Grant 91120005, and Grant 61473231. This paper was recommended by Associate Editor M. Cetin. (Corresponding author: JG. Han.) J. Han, C. Chen, and X. Hu are with the School of Automation, Northwestern Polytechnical University, Xi’an 710072, China (e-mail: [email protected]; [email protected]; [email protected]). JG. Han is with Civolution Technology, Eindhoven 5656AE, The Netherlands (e-mail: [email protected]). L. Shao is with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle NE1 8ST, U.K. (e-mail: [email protected]). T. Liu is with Cortical Architecture Imaging and Discovery Laboratory, Department of Computer Science, University of Georgia, Athens, GA 30602 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2358647

from this phenomenon are “What makes a video memorable?” and “Can we build computational models to predict the video memorability (VM)?” The study of these questions will significantly benefit a wealth of people including not only common users, but also professional ones such as artists, photographers, advertisers, and makers of video sharing websites, games, movies, and multimedia magazines. Although, understanding visual memorability is particularly noteworthy, the majority of current studies are from the psychological or cognitive sciences points of view, and as far as we know, very little work has been done from the views of computational vision and brain imaging. From the psychological perspective, many works [1], [2] constructed large-scale experiments to demonstrate that human subjects are able to store the visual stimuli in long term memory. For instance, Konkle et al. [1] used scene images as the visual stimuli to construct user study experiments. They discovered that observers can retain some specific scenes even with their relatively detailed representations for a long time. Another work [2] used a movie episode as the stimuli to test the memory performance, which also revealed that viewers can still recall much content in the movie after a few months. In contrast, the examination of visual memorability from the perspective of computational science is still in its infancy. Only a few recent papers [5]–[7], [25], [35], [36], [39], [40] have explored image memorability. Specifically, a pioneering work was recently proposed by Isola et al. [5], [6] and Khosla et al. [40] with the primary purpose of building computational models to predict image memorability from low-level visual features. Firstly, a memory game experiment was constructed to build the ground truth for measuring image memorability. In the experiment, viewers were asked to watch photographs and detect repeated presentations. Image memorability was defined as the rate of correct detections of repeats by subjects. Afterwards, by treating image memorability as one type of image intrinsic properties such as aesthetics [8], [11], photographic quality [10], saliency [13], and interestingness [8], the prediction models were trained to map from visual features to memorability scores. Additionally, a recent paper [7] adopted the same ground truth data as that in [5] and [6] to build models to predict the memorability of image regions. Khosla et al. [39] built a model to explore various visual features for predicting face memorability. Almost, all existing studies on visual memorability have been devoted to modeling image memorability, whereas to the best of our knowledge, there is neither published work

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

1693

Fig. 2. Two examples of functional connectivity matrices upon 358 brain regions of interests (ROIs) from one subject for two video clips. Video clip of (a) memorable commercial and (b) forgettable weather report. Fig. 1.

Number of memorable and forgettable video examples.

that comprehensively explored VM nor video benchmarks established for the particular purpose of predicting VM. Videos contain a synthesis of visual, aural, textual, and motion information and generally convey relatively richer media content than images. The usage of static visual features [5]–[7], [39], [40] alone such as color and object statistics is typically inadequate to describe a rich video clip. In contrast to image memorability, more factors may affect the memorability of a video, for example, motion and audio features. Let us illustrate a number of examples in Fig. 1 to explain the VM problem (each row represents a video clip example). The top two examples are highly memorable, which include a commercial clip and a slam dunk clip. The reasons why they are remembered might be: they contain some recognizable super stars (for instance, “Kobe Bryant”), beautiful moving behaviors (for instance, slam dunk), new interesting products (for instance, a car), or exciting sound tracks. On the contrary, the bottom two examples in Fig. 1 may not be that memorable because they mainly consist of uninteresting objects, peaceful scenes, and drowsy sound tracks. Additionally, VM is also influenced by personal experience. For example, a video of event involving family members might be highly memorable whereas a similar event video involving unknown persons might be easily forgotten. A few recent works [5]–[7], [25], [35], [36], [39], [40] have shown that image memorability is an intrinsic image property consistent across different subjects and it can be quantitatively measured. These works built computational models to predict memorability of images from low-level visual features automatically generated by computer algorithms and highlevel image semantic attributes manually generated by human experts. These previous works motivate us to develop a computational model to predict memorability of video clips based on features automatically extracted using computer algorithms. The prediction of VM may benefit a wide array of applications including video search (e.g., ranking more memorable videos earlier in retrieved results), video summarization (e.g., extracting more memorable video segments as the highlight), and video advertisement or movie trailer making. As shown in [5]–[7], a prediction model fully depending on low-level visual features may achieve unsatisfactory results. The inherent gap between the limited descriptive ability of

low-level features and the richness of high-level semantics perceived by humans is the key bottleneck in inferring visual memorability. Essentially, humans should be the unique end users and evaluators of video content. Quantitative modeling of the interactions between video streams and the brain’s responses can provide powerful guidance for video semantic cognition. A few initial studies [18]–[20], [32] have been performed in the literature to decode brain activity from natural stimuli for some vision tasks, for example, image retrieval and classification. Recently, an influential paper [22] adopted functional magnetic resonance imaging (fMRI) experiments to show that there are high temporal correlations between relevant fMRI signals and semantic content in the movie stream. Another work [23] demonstrated that the level of control over viewers’ brain activity varied with the change in a movie’s content, editing, and directing style. Furthermore, Hasson et al. [21] constructed a subsequent memory experiment for watching movies. They successfully identified brain regions for which the responses showed higher intersubject correlations to remembered movie segments, as compared to unremembered ones. In summary, Hasson et al. [21]–[23] provided direct and strong evidence that the analysis of fMRI data enables us to more reliably and quantitatively account for the mechanisms of understanding and memorizing videos in human brains, which is the underlying inspiration and premise of this paper. Fig. 2 shows two exemplar functional connectivity matrices corresponding to two different types of video shots (a memorable commercial video versus a forgettable weather report video), respectively. It can be seen that the connectivity patterns of them are quite different. More specifically, the functional interactions are globally much stronger and involve more brain regions when watching the commercial video clip than that when watching the weather report video clip. This observation offers the clue that we may adopt brain activity reflected by fMRI data to quantitatively predict memorability of a video. In this paper, we present a novel methodology of modeling VM by combining the power of low-level audiovisual features with fMRI-derived features that reveal brain activity. As depicted in Fig. 3, a user study experiment is first constructed on the publicly available text retrieval conference video retrieval evaluation (TRECVID) videos to create a benchmark database for measuring VM. A number of lowlevel audiovisual features are selected and evaluated on the benchmark database. Afterwards, an experiment is designed to

1694

Fig. 3.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

Architecture of the proposed computational model of VM.

acquire fMRI data when the human subjects are watching the video clips. Based on a publicly available brain reference and localization system called dense individualized and common connectivity-based cortical landmarks (DICCCOL) [24], a set of brain ROIs involved in the memorization of video clips are identified and localized. The functional connectivities between each identified ROI pair are utilized as the fMRI-derived attributes that are highly informative for VM. Finally, given the fact that fMRI scanning is expensive and time-consuming, a computational model (or called prediction model) is learned based on our benchmark dataset of VM with the objective of maximizing the correlation between the low-level audiovisual and the fMRI-derived features using canonical correlation analysis (CCA). The learned computational model is then able to predict the memorability of any test videos without fMRI scans. The contributions of the proposed work are fourfold. 1) A ground truth dataset based on user study experiments is developed for investigating VM. To the best of our knowledge, it is among the earliest databases in the area of VM computation. 2) The fMRI-derived high-level features that convey the brain activity of memorizing videos are extracted in the natural stimulus fMRI experiment. 3) A simple yet effective computational model is proposed to predict VM by combining the power of audiovisual features and fMRI-derived features. 4) A number of automatically extracted audiovisual features are examined to predict VM, where a few of them such as saliency dispersion (SD), object occurrence (OO), and object attribute (OA), are firstly introduced for the VM task. Promising results have been achieved in our experiments. II. DATABASE OF VM Inspired by [6], we conducted a memory game experiment to build a ground truth database for quantitatively measuring VM. A number of subjects were invited to view video clips and detect repeated presentations. VM was

quantified as the percentage of correct detections of repeats by subjects. A. Ground Truth Construction Because TRECVID is a common, well-known, and publicly available video dataset, we selected 2418 video clips from TRECVID 2005 to construct our video dataset. The time length of all video clips is in the range of 15 to 30 s. These video clips cover a variety of semantic categories such as basketball games, soccer games, tennis games, horse racing, weather reports, commercial on cars, food, and health care. From this dataset, 222 video clips were randomly selected as “targets” and the remaining 2196 videos were used as “fillers.” The user study consisted of two stages: 1) free viewing stage and 2) memory testing stage. 44 video sequences called free viewing sequences were composed using video clips from our dataset. Each free viewing sequence consisted of randomly ordered 30 video clips with a 2-s gap between video clips. Each of the first 42 free viewing sequences included five target video clips and 25 filler video clips. The last two free viewing sequences included six target video clips and 24 filler video clips. Each of the 222 target video clips appeared in the 44 free viewing sequences exactly once. The filler video clips in each free viewing sequence were sequentially and randomly selected from those 2196 filler video clips. The role of fillers was to guarantee that viewers are unaware of targets. Each free viewing video sequence corresponds to a memory testing video sequence. Each memory testing sequence also consisted of randomly ordered 30 video clips. It was composed by the same set of target video clips as its corresponding free viewing sequence and a different set of filler video clips. In total, 20 healthy adult volunteers were invited to participate in our user study experiment. Volunteers performed the experiment independently and none of them was previously familiar with those video clips. In the free viewing stage, those free viewing sequences were shown to every subject. There was a break of 5 min between sequences. The memory testing stage was performed after two days. In this stage, those memory testing video sequences with a break of 5 min between sequences

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

Fig. 4. (a) Key frames of a number of video samples with high, moderate, and low memorability scores, respectively. (b) Distribution of memorability scores.

were shown to each subject. While watching the memory testing video sequences, the subject was asked to press the space bar whenever he/she saw an identical repeat of a video clip (indicating he/she had watched that video clip in the free viewing stage). Considering that a subject may lose his/her concentration after watching videos for too long a time, we actually conducted the user study experiment in five different sessions. Each experiment used one fifth of the 44 free viewing sequences and corresponding memory testing sequences. The time interval between these five sessions was one week. After collecting the data from all experiments, a “memorability score” was assigned to each target video clip. It is defined as the hit rate which is the percentage of correct memories by the participants by following the methods in [5] and [6]. Fig. 4(a) shows three types of key frames of video samples (each row represents a video clip sample) with high (two top rows), moderate (two middle rows), and low memorability (two bottom rows) scores, respectively. Fig. 4(b) shows the distribution of the memorability score for the 222 target video clips. The average memorability score was 61.22%. In the examples shown in Fig. 4, it is reasonable to conclude that those two specific commercial video examples are more memorable than those sports video examples and weather reports video examples. B. Consistency Analysis Following [6], we performed two tests to evaluate the human consistency in the memory game experiment. In the first test, we randomly split our 20 participants into two independent halves (group 1 and group 2), and calculate how well VM scores from the first half of the participants match VM scores from the second half of the participants. By computing the average over 20 randomly split half trials, we obtained a Spearman’s rank correlation of 0.70 between these two sets of scores. In the second test, all participants were also randomly split into two nonoverlapping halves, group 1 and group 2. The consistency was evaluated by using a variant of a precisionrecall task where images were sorted in decreasing order in terms of their scores given by group 1 and then cumulative average memorability was calculated according to group 2 as we moved across this ordering. This test was repeated 20 times and the mean result is plotted in Fig. 5. As can be seen from the results of the above two tests, our constructed VM database reflects high human consistency and it is feasible

1695

Fig. 5. Measure of human consistency. Left: videos were sorted by memorability scores from participants in one group and plotted against the cumulative average memorability scores according to participants in the other group. Right: Spearman’s rank correlation between two subject groups as a function of the mean number of scores per image.

to examine VM. In addition, our tests also demonstrate that the videos remembered by one person are more likely to be remembered also by others. III. AUDIOVISUAL F EATURE C OMPUTATION A variety of multimedia features may be used for predicting VM. In this paper, we extracted and examined three types of relevant features: 1) static visual features; 2) dynamic visual features; and 3) audio features. Since, our objective is to build an automatic computational model for VM, all our features can be automatically obtained by computer algorithms. A. Static Visual Features Isola et al. [6] have comprehensively evaluated a number of static visual features on the task of image memorability. In this paper, we propose some visual features that can achieve promising performance as demonstrated in Section VI-A. All static visual features are extracted on key frames. Briefly, key frames are still images which can reasonably represent the content of video clips in an abstracted manner. TRECVID 2005 contains key frames for each video shot with good accuracy. We, therefore, used the key frames provided by TRECVID 2005 in this paper. According to [37], these key frames were extracted by a group at Dublin City University using an automatic approach. For a video clip, we use the mean of predicted VM scores of all key frames belonging to this video clip as its predicted VM score. 1) SD: Human visual attention mechanisms can select a subset of interesting inputs in the visual field for further cognition. In the field of computer vision, saliency detection models [3], [4], [13] are generally used to quantitatively predict the attended locations in an image. Recent works [25], [35] have shown that attention-based features have high importance in visual memorability. In this paper, we propose a new attention-based feature called SD and use it to predict VM. A similar idea has been applied in [33]. However, instead of using interobserver fixation congruency [33], our SD was calculated on a single saliency map obtained by visual saliency detection models. Our specific rationale is described

1696

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

Fig. 7. Fig. 6. Two groups of examples with low and high entropy, respectively. The top two rows show the examples with low entropy, where the first and second row show the original images and their corresponding saliency maps, respectively. Similarly, the bottom two rows show the examples with high entropy.

as follows. If the salient locations in a saliency map are concentrative, it implies that there are a few meaningful objects in this image which could attract the viewers’ interest. In contrast, when the salient locations are dispersed all over the image, it implies that it is hard to find interesting objects in the image. As pointed out in [35], there are relationships between visual memorability and interesting objects in an image. We adopted an entropy-based approach to quantify SD. Given a key frame, the algorithm in [13] was firstly applied to yield a saliency map. Then, the saliency was binarized and salient regions were segmented. For each salient region i, with the saliency value Si (the sum of saliency values of all pixels in the region), the probability was defined as Si . The SD of the key frame was then calculated Pi = Si / i by H = − i Pi log Pi . Low entropy of an image indicates that there might be a few meaningful objects in this image. Fig. 6 shows two groups of samples with low and high entropy values, respectively. 2) OO: Isola et al. [5], [6] have demonstrated that objectbased features are effective for predicting image memorability, but their object-based features were manually labeled. In this paper, we applied object bank [12] to automatically derive the features that describe occurrences of a variety of objects in an image. Unlike the original work [12] that aimed to yield a high-level image representation based on objects for the task of scene categorization, we are interested in characterizing the occurrences of various objects in the images. As illustrated in Fig. 7, given an image, we concatenated the maximum responses of each object detector across multiple scales and across the whole image to form a histogram-like feature vector where each element can approximately imply the likelihood of a certain object being present in the image. Our current implementation used 208 object detectors [12], which results in a 208 dimensional descriptor of OOs for each key frame. 3) OA: The OO features quantify the presence of various objects in a visual scene. However, two objects with the same label (e.g., “clothes”) may be quite different in material or size. We believe that OAs such as parts, materials, shapes, and

Flowchart of generating OO features using object bank [12].

sizes, also play an important role in recognizing objects and further understanding visual content. Hence, there should exist a high correlation between OAs and VM. In this paper, we extracted OA features to describe a video key frame by using the trained attribute classifiers developed in [31]. OAs consist of semantic attributes and discriminative attributes. Semantic attributes further include shape attributes, part attributes, and material attributes. 4) Scene Complexity (SC): As pointed out by [33], the visual complexity of a scene may correlate with observers’ understanding of a picture. It thus may influence the visual memorability. By following [33], three features were extracted to represent SC of a video frame, including the sum of the entropy values of wavelet subbands, the number of regions, and the amount of contours. 5) Background Simplicity (BS): As indicated in [27], to reduce the attention distraction from the objects in the background, photographers often make backgrounds simple. By following [27], we use the color distribution measured by the color histogram of the background to quantify BS. 6) Color: Color is an important feature of images. In this paper, we extracted four color features including saturation, brightness, colorfulness, and contrast, to examine the relationship between color features and VM. The color brightness is a measure of the amplitude of the light wave. The color saturation is a measure of the vividness. The colorfulness measures a color’s difference against gray. Contrast measures the local luminance variation in a surrounding area. The details of implementing these color features can be found in [26]. B. Dynamic Visual Features Due to its outstanding performance, we utilized the sparse spatio-temporal features proposed in [14] to characterize the motion information, which is built on the space time interest points defined by gradient and optical flow structure in a fixed window around 3-D corner features. The rationale here is that salient motion patterns could be related to VM. In our current implementation, a 100 dimensional feature vector was employed to represent the dynamic visual information in a video clip.

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

C. Audio Features We adopted the conventional mel frequency cepstrum coefficients (MFCC) [16] features based on the short-term power spectrum of a sound as the audio features. A 13 dimensional MFCC feature vector was generated to represent each video clip. A variety of studies have demonstrated the effectiveness of the MFCC features in characterizing audio streams. In addition, we also extracted energy, brightness, entropy, roughness, novelty, low energy rate, zero-crossing rate, roll off, and pitch features. All these features were obtained based on the MIRtoolbox [29]. IV. fMRI-D ERIVED F EATURE C OMPUTATION In brief, we leverage fMRI neuroimaging techniques to monitor and quantify the brain’s responses to video stream stimuli, and subsequently to identify relevant brain networks and ROIs involved in the comprehension and memorization of video content. Eventually, semantic features reflecting the functional brain activity of forming memory are derived. A. fMRI Data Acquisition Under Natural Stimuli To explore brain memorization of video content, we designed an experiment to perform fMRI scanning when subjects are watching video stimuli. Three healthy adults (who did not participate in the memory game described in Section II) were recruited to participate in this paper. 93 video clips were randomly selected from those 222 target videos used in Section II and then they were presented to the three subjects during an fMRI scan via MRI-compatible goggles. It should be mentioned that these three subjects were not given any tasks while they were scanned. MRI data was acquired in a GE 3T Signa HDx MRI system. The multimodal diffusion tensor imaging (DTI) and fMRI scans were performed in three separate scan sessions for each subject. DTI scans were performed for each participant and the data was acquired using the isotropic spatial resolution 2 × 2 × 2 mm. The fMRI parameters were as follows. 4 mm slice thickness, 64 × 64 matrix, 220 mm field of view, 30 slices, echo time = 25 ms, array spatial sensitivity encoding technique = 2, and repetition time = 1.5 s. A precise synchronization between movie viewing and fMRI scan was achieved via the E-prime software. The preprocessing of fMRI data [20] included skull removal, spatial smoothing, motion correction, slice time correction, temporal prewhitening, and global drift removal. B. Relevant Brain ROI Localization and fMRI-Derived Feature Extraction In principle, a human’s brain function is realized via large-scale structural and functional connectivities. The functional connectivities and interactions among relevant brain networks reflect the brain’s comprehension and memorization of video stimuli. Traditional task-based fMRI has been widely regarded as a benchmark approach to localize functionallyspecialized brain ROIs. However, as indicated in [32], traditional task-based fMRI analysis is incapable of measuring

1697

brain responses under natural stimuli for two reasons. First, it is very hard to acquire large-scale task-based fMRI data for the same group of subjects due to the cost and time limitations. Second, task-based fMRI normally exploits relatively simple tasks or well controlled stimuli, while the natural stimuli could be complex and the brain activity for comprehending them involves a large amount of functional networks such as attention, emotion, visual, and working memory. In this paper, we adopted a data-driven approach by combining a publicly available brain reference system called DICCCOL [24] and a discriminative learning scheme to localize large-scale relevant functional networks under natural stimuli. Briefly, DICCCOL [24] is a universal and individualized brain reference system that can identify 358 consistent and corresponding structural landmarks in multiple brains and populations. Each identified landmark is optimized to possess maximal group-wise consistency of DTIderived fiber connection patterns. This set of 358 structural brain landmarks has been validated in a variety of healthy populations, demonstrating remarkable reproducibility and predictability. We applied DICCCOL [24] to localize the 358 ROIs in the scanned subjects with DTI data. fMRI signals were extracted for each of these 358 ROIs after a linear transform from the ROIs in the DTI space to the fMRI image space. We adopted the commonly used functional connectivities of brain ROIs to model the human brain’s responses to free viewing of video streams. Typically, the functional connectivity between two ROIs is measured as the Pearson correlation coefficient between their representative fMRI time series. As a result, for each scanned video clip, we constructed a 358 × 358 connectivity matrix that essentially has 63 903 individual elements (the connectivity matrix is symmetrical) to represent the brain’s responses to the video stimuli. However, not all 358 ROIs are involved in memorizing videos. We aim to identify those intrinsic ROIs whose responses are more relevant to discriminate more memorable videos from less memorable videos. A discriminative learning strategy using feature selection was thus adopted to solve this problem. At first, we equally divided the range (0%–100%) of VM score into five scales. Those target video clips were classified into five classes in terms of the scale into which their ground truth value of VM falls. Afterwards, feature selection was applied based on connectivity matrices of target videos with the objective of maximizing the accuracy of correct classification. Since our feature dimensionality is relatively large (63 903), we performed feature selection using a two-stage procedure. In the first stage, a statistical test of analysis of variance (ANOVA) was used to remove large numbers of irrelevant features, which determines each dimensional feature’s relevance individually. The feature was still large with over 3000 dimensions and had lots of redundant information after applying the ANOVA method. In the second stage, the algorithm based on the improved sparse logistic regression [17] was applied to the selected features in the first stage to further refine the results. After this stage, the dimensionality was reduced to a few tens.

1698

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

Fig. 8. (a) Top ten brain networks involved in comprehension and memorization of videos. The percentage is calculated as the ratio between the number of identified ROIs belonging to a functional brain network and the number of all identified ROIs. (b) Localization visualization of identified brain ROIs for comprehension of videos. The spheres (both green and white) are the DICCCOL landmarks. The green ones are the landmarks involved in the identified function connections. The blue lines bridging ROIs indicate the functional connectivity of them.

C. Mapping of Brain Networks and ROIs Involved in the Memorization Of Videos We used the publicly available DICCCOL system [24] and the discriminative learning method described in Section IV-B to jointly identify a collection of brain ROIs involved in the comprehension and memorization of video clips. We also labeled the functions of those selected ROIs into different brain networks in terms of the DICCCOL system’s functional annotations [24]. Fig. 8(a) lists the top ten functional networks with the largest percentages of those selected ROIs. The result indicates that the selected functional networks are reasonable and meaningful according to current neuroscience knowledge. Specifically, the attention, speech, semantics, and emotion systems are among the most relevant brain networks in comprehension and memorization of videos. In addition, the spatial locations of those ROIs are shown in Fig. 8(b). It is observed that a wide range of cortical regions, such as the pre and post central gyrus, superior occipital gyrus, superior frontal gyrus, and the interactions among them might be involved in the comprehension and memorization of videos. In summary, our result suggests that the functional interactions of large-scale brain networks are engaged in video comprehension and our discriminative learning can select the functionally relevant brain networks. Though, comprehensively exploring the functional mechanism of video memorization is out of the scope of this paper, the cortical ROIs and functional interactions among them identified in our experiments may provide novel insights for the further study of this problem. V. P REDICTION M ODEL As shown in Fig. 3, once fMRI-derived features and audiovisual features are obtained, we subsequently train a support vector regression (SVR) [38] model to map these features to memorability scores. Here, SVR model is essentially a prediction function that is trained to estimate the memorability

score of a test video clip. Although, the fMRI-derived features are powerful, the acquisition of these features via fMRI scanning is expensive. It is impractical to carry out fMRI scans for all videos in case a large-scale video collection is given. Therefore, our prediction model has two objectives. First, it can combine the power of low-level audiovisual features and high-level fMRI-derived features. Second, it can reliably predict memorability scores of a test video without fMRI scans. To fulfill these objectives, we perform a subspace learning using CCA that preserves the correlation between audiovisual and fMRI-derived feature spaces. Based on the obtained CCA model, the videos associated with audiovisual features only can then be projected onto the learned subspace. The prediction model using SVR then works in this subspace. CCA aims to find basis vectors for two sets of features such that the correlation between the projections of the features onto these basis vectors is maximized. Denote X as the fMRIderived feature vector and Y as the audiovisual feature vector for the video samples. The original barnett preisendorfercanonical correlation analysis [9] performed in the empirical orthogonal function (EOF) space to avoid the inversion of singular matrices is an effective solution to calculate CCA. In this paper, the EOF space is the principal component eigen space. Therefore, principal component analysis (PCA) was first performed for X and Y, which can be represented as pcX = X · EX and pcY = Y · EY

(1)

where pcX and pcY are the principal components of X and Y. EX and EY are the eigenvector matrices of the covariance matrices of X and Y, respectively. Now, CCA is reformulated to extract the correlated modes between vectors X and Y by seeking for a set of transformation vector pairs, Ai and Bi , which gives the canonical variates μi and νi the maximum correlation μi = pcX T Ai and νi = pcY T Bi .

(2)

The correlation between μi and νi can be computed by ri = 

ATi CpcXpcY Bi

(3)

ATi CpcXpcX Ai BTi CpcYpcY Bi

where CpcXpcY is the cross-covariance matrix of pcX and pcY. CpcXpcX and CpcYpcY are the auto-covariance matrices. ri is maximized by setting ∂ri /∂Ai = ∂ri /∂Bi = 0, and then  −1 −1 CpcXpcY CpcYpcY CpcYpcX Ai = ri2 Ai CpcXpcX (4) −1 −1 CpcYpcY CpcYpcX CpcXpcX CpcXpcY Bi = ri2 Bi . By solving the eigenvalue problem in (4), we can obtain the ascending ordered correlation {r1 , r2 , . . . , rn }, and the corresponding transformation vectors, A = [A1 , A2 , . . . , An ] and B = [B1 , B2 , . . . , Bn ]. The corresponding sets of canonical variates can be expressed as U = [u1 , u2 , . . . , un ]T and V = [v1 , v2 , . . . , vn ]T . The transform EY and B are the learned PCA and CCA transform models, respectively. In practice, the representation of video samples is achieved by applying EY to describe the video sample in the principal component EOF space, and followed by applying B to represent the video samples in the latent space.

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

VI. E XPERIMENTS A. Evaluating Static Visual Features on Image Memorability Database In this paper, we developed six static visual features. The former three features (SD, OO, and OA) are new features designed for visual memorability prediction, while the latter three features (BS, SC, and Color) have been used in the existing applications, such as image attractiveness prediction, image quality evaluation, and image interestingness assessment. In this section, we used the image memorability benchmark proposed in [5] and [6] to evaluate these six features. This benchmark data contains 2222 images with ground truth memorability scores obtained using a repeat detection experiment. We trained the prediction model via SVR [38] by using each individual feature and the combination of features. Our experimental settings and evaluation metrics followed [6]. Specifically, we split the image set of 2222 images and participant set into two independent, random halves. The prediction model was trained on one half of the images and tested on the other half of images. The prediction trials were repeated 25 times and the average results were taken. During the training of each SVR, we utilized the radial basis function kernel and performed grid search to obtain cost and ε hyperparameters [38]. Similar to [6], we adopted two methods to evaluate the performance of our predictions. First, we calculated the Spearman’s rank correlation (ρ) between predicted memorabilities and ground truth memorabilities. Second, we sorted images by predicted score and selected various ranges (for example, top 20, top 100, bottom 100, and bottom 20) of images in this order, and examined average ground truth memorability on these ranges. We also compared the performance of using the proposed features with that of using low-level features in [6] and [7]. Table I concludes the comparison experiment, in which we directly excerpted the results from [6] and [7]. It is seen that the proposed three static visual features (SD, OO, and OA) can achieve promising results. Especially, the OO feature is quite effective for measuring image memorability. The combination of those three proposed features can improve the performance of using all global features in [6] by 0.03 (measured by ρ). Those three features (BS, SC, Color) that have been used in existing works are easy to be implemented and can be calculated quickly, which is appropriate for processing large-scale video data. The combination of them and our proposed three features of SD, OO, and OA can obtain the performance comparable to [7] when predicting image memorability. B. Evaluating fMRI-Derived Features In this paper, we proposed the fMRI-derived features in Section IV-B, which can reflect the human brain’s responses while memorizing video clips. In this experiment, we quantitatively evaluated its performance. Since 93 video clips have corresponding fMRI data, we conducted the experiment on those 93 video clips. By following the procedure in Section IV-B, we extracted fMRI-derived features based on each subject’s fMRI data. Then, the leave-one-out cross-validation strategy

1699

was utilized to measure the performance. It used a single sample from the dataset as the validation data, and the remaining samples as the training data. This was repeated such that each sample in the dataset is used once as the validation data. Similar to [6], we adopted two methods to evaluate the performance of our predictions. First, we calculated the Spearman’s rank correlation (ρ) between predicted memorabilities and ground truth memorabilities. Second, we sorted images in terms of the predicted score and selected various ranges (e.g., top 20, top 100, bottom 100, and bottom 20) of images in this order, examining average ground truth memorability on these ranges. For comparison, we also computed the prediction performance by only using low-level audiovisual features. In addition, we also assessed if the prediction results of fMRI-derived features are consistent across three subjects who attended fMRI scanning (we call these subjects fMRI subjects). The evaluation was repeated for each of three fMRI subjects. Table II shows the comparison results. Two observations can be gained based on the results. First, the fMRI-derived features are quite effective for VM prediction. Their prediction accuracy is about 0.2 (measured by ρ) higher than that of low-level audiovisual features, which is considered as a significant improvement. Second, the prediction performance corresponding to each fMRI subject is generally good and the prediction variance across fMRI subjects is rather small, which prove the fMRI-derived feature extraction proposed in this paper is relatively robust to different subjects. C. Evaluating the Prediction Model on VM Database In this paper, we have proposed a CCA-based computational model for VM prediction to cope with the problem that most video clips have no fMRI data. In this section, we constructed experiments to evaluate the proposed prediction model on our VM database containing 222 video clips with the ground truth data. First, 93 videos with both fMRI-derived features and all audiovisual features were used to build the CCA-based subspace as described in Section V. The process of building this subspace is a purely unsupervised procedure. Then, the prediction model was trained in the CCA-based subspace. The leave-one-out cross-validation strategy was exploited to measure the prediction performance. It used a single sample from those 222 video data as the validation data, and the remaining samples as the training data. Fig. 9 shows a number of successful prediction and failed prediction samples. Another experiment was performed by following the setting similar to [6]. We firstly used 93 videos with both fMRIderived features and audiovisual features to learn a CCA-based subspace in an unsupervised fashion. Then, the prediction model was trained in the CCA-based subspace. We randomly split those 222 video data into two independent sets: a training set has 70% of videos and a testing set has the remaining 30% of videos. Afterwards, the prediction model was trained on the training set and tested on the testing set. The prediction trials were repeated 25 times and the average results were calculated. The quantitative evaluations were performed. Similar to [6], we used the Spearman’s rank correlation (ρ) and average

1700

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

TABLE I P ERFORMANCE C OMPARISONS BY U SING O UR D EVELOPED F EATURES AND F EATURES IN [6] AND [7]. “T” I NDICATES “T OP ” AND “B” I NDICATES “B OTTOM ”

TABLE II P ERFORMANCE C OMPARISONS BY U SING fMRI-D ERIVED F EATURES AND AUDIOVISUAL F EATURES

Fig. 9. Examples of the predictions obtained by our model. Top four rows show four video clips with good prediction. Bottom four rows show video examples with bad prediction.

ground truth memorability on various ranges of sorted images as the performance metrics. These two metrics have been described in Section VI-B. We compared the performance of models trained by using each individual audiovisual feature, the combination of all visual features, the combination of all audiovisual features, and the proposed prediction model combining fMRI-derived features and all audiovisual features, respectively. Additionally, to verify the model consistency between the fMRI-driven features derived from different fMRI subjects, we evaluated the computational model using each subject’s fMRI data. Table III gives the comparison results when using the leave-one-out cross-validation strategy. Table IV shows the comparison results of using the strategy of 70% training data and 30% testing data. There are a few interesting observations from the experimental results. First, dynamic motion features and the audio features are quite effective for VM prediction. Among those three static features (BS, SC, color) which have been widely

used in the previous works, SC and color features show promising performance for predicting VM. Those three proposed static features (OO, OA, and SD) show reasonably good prediction performance. Among all static features, OO feature approximately indicating the likelihood that each object is present in the video achieves the best prediction capability. We further investigated the importance of different objects. The idea is to learn the prediction model (SVR) only based on OO feature and then sort the learned weight vector. Normally, a higher weight indicates the corresponding object has the higher correlation with VM. According to our experiment, objects with higher weights include people, plant, lion, mouse, swing, tool, baseball, ski, truck, and art gallery. On the contrary, objects with lower weights contain counter, stage, wall, ocean, floor, computer monitor, electric switch, bus stop, mountain, and button. Second, we calculated the correlation between SD feature and VM and found they are negatively correlated. It indicates that video frames with only a few objects of interest may be more memorable. A similar conclusion has also been drawn in [25] for measuring image memorability. Third, the proposed computational model based on fMRI-derived features and audiovisual features can improve the performance of the model using low-level audiovisual features only by approximately 0.08 on average (measured by ρ). Fourth, the prediction variance across three fMRI subjects is small, which shows the robustness of the proposed model. We also investigated the correlation between VM and two additional factors: 1) video length and 2) video category. Based on our benchmark database with 222 videos, we firstly calculated between the video length measured by seconds and the ground truth VM score. The value of is 0.16, which means video length does not correlate strongly with VM. Second, we calculated the average memorability score of each video category. The results are shown in Table V. As can be seen, commercial videos generally are more memorable than other categories of videos such as sports videos and weather report videos.

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

1701

TABLE III P ERFORMANCE C OMPARISONS BY U SING VARIOUS AUDIOVISUAL F EATURES AND THE P ROPOSED C OMPUTATION M ODEL BASED ON THE S TRATEGY OF L EAVE -O NE -O UT C ROSS -VALIDATION

TABLE IV P ERFORMANCE C OMPARISONS BY U SING VARIOUS AUDIOVISUAL F EATURES AND THE P ROPOSED C OMPUTATION M ODEL BASED ON THE S TRATEGY OF 70% T RAINING DATA AND 30% T ESTING DATA

TABLE V AVERAGE VM S CORES OF VARIOUS V IDEO C ATEGORIES

VII. C ONCLUSION This paper presented a novel framework of combining lowlevel audiovisual features and fMRI-measured functional brain activities via subspace learning to tackle the problem of predicting VM. The evaluations on publically available image and video datasets demonstrated its effectiveness. Altogether, our computational framework demonstrated the great promise of combining machine intelligence and human cognitive power to narrow down the long-standing problem of “semantic gaps” in computer vision. This paper echoed the insightful vision in [34] that “neuroscientists and cognitive psychologists are only beginning to discover and, in some cases, validate abstract functional architectures of the human mind. However, even the relatively abstract models available from today’s measurement techniques promise to provide us with new insight and inspire innovative processing architectures and machine learning strategies.” The goal of this paper was to develop a model to predict memorability of videos using algorithmically extracted features. This paper has investigated the feasibility of two aspects of features including fMRI-derived features and audiovisual features for predicting VM. Our experimental results have shown that fMRI-derived features calculated as the functional connectivity between selected brain ROIs from a few brain networks such as the attention, speech, semantics, and emotion, can predict memorability with a good performance. We also tested various audiovisual features and found that

dynamic motion features and audio features are effective for VM prediction. Among a set of static visual features, the feature (OO) that indicates the occurrence likelihood of each object in the video has shown the best prediction capability. This paper is our initial attempt to predict VM given current computer vision techniques. Overall, it is still in an infant period. Although, these types of prediction models are quite useful from an engineering point of view, they are often uninterpretable. Based on those low-level audiovisual features used in this paper, it is difficult to intrinsically understand what makes the video memorable. To better understand VM, we may need to follow [5] to annotate a collection of human-understandable, interpretable, and semantic attributes for videos and then discover factors that are highly informative about VM. This will be one of our future works. Other future works will be performed to further improve the proposed work in the following aspects. First, we will largely expand the scale of our benchmark video database and construct large-scale user study experiments, which can serve as independent reproducibility studies. Second, more novel neuroscientific findings regarding the neural mechanisms of video perception and comprehension will be integrated into our computational framework in order to extract more comprehensive and systematic measurements of the brain’s functional responses to video stimuli. Finally, it is vital for researchers to share resources, e.g., the TRECVID dataset and DICCCOL

1702

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 8, AUGUST 2015

system used in this paper, to independently cross-validate computational algorithms from different labs in the future. R EFERENCES [1] T. Konkle, T. F. Brady, G. A. Alvarez, and A. Oliva, “Scene memory is more detailed than you think: The role of categories in visual long-term memory,” Psychol. Sci., vol. 21, no. 11, pp. 1551–1556, 2010. [2] O. Furman, N. Dorfman, U. Hasson, L. Davachi, and Y. Dudai, “They saw a movie: Long-term memory for an extended audiovisual narrative,” Learn. Memory, vol. 14, no. 6, pp. 457–467, 2007. [3] Q. Wang, Y. Yuan, P. Yan, and X. Li, “Saliency detection by multipleinstance learning,” IEEE Trans. Cybern., vol. 43, no. 3, pp. 660–672, Apr. 2013. [4] Z. Yücel et al., “Joint attention by gaze interpolation and saliency,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 829–842, Jun. 2013. [5] P. Isola, D. Parikh, A. Torralba, and A. Oliva, “Understanding the intrinsic memorability of images,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2011, pp. 2429–2437. [6] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva, “What makes a photograph memorable?” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1469–1482, Jul. 2014. [7] A. Khosla, J. Xiao, A. Torralba, and A. Oliva, “Memorability of image regions,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2012, pp. 296–304. [8] S. Dhar, V. Ordonez, and T. L. Berg, “High level describable attributes for predicting aesthetics and interestingness,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2011, pp. 1657–1664. [9] T. Barnett and R. Preisendorfer, “Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis,” Mon. Weather Rev., vol. 115, no. 9, pp. 1825–1850, 1987. [10] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for photo quality assessment,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Washington, DC, USA, 2006, pp. 419–426. [11] A. K. Moorthy, P. Obrador, and N. Oliver, “Towards computational models of the visual aesthetic appeal of consumer videos,” in Proc. Eur. Conf. Comput. Vis., Heraklion, Greece, 2010, pp. 1–14. [12] L.-J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Objects as attributes for scene classification,” in Proc. Eur. Conf. Comput. Vis., Heraklion, Greece, 2010, pp. 57–69. [13] J. Tilke, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 2106–2113. [14] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in Proc. IEEE Int. Workshop Vis. Surveillance Perform. Eval. Tracking Surveillance, 2005, pp. 65–72. [15] Q.-S. Sun, S.-G. Zeng, Y. Liu, P.-A. Heng, and D.-S. Xia, “A new method of feature fusion and its application in image recognition,” Pattern Recognit., vol. 38, pp. 2437–2448, Dec. 2005. [16] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for nonsyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980. [17] G. C. Cawley and N. L. Talbot, “Gene selection in cancer classification using sparse logistic regression with Bayesian regularization,” Bioinformatics, vol. 22, no. 19, pp. 2348–2355, 2006. [18] J. Wang et al., “Brain state decoding for rapid image retrieval,” in Proc. ACM Int. Conf. Multimedia, New York, NY, USA, 2009, pp. 945–954. [19] A. Kapoor, P. Shenoy, and D. Tan, “Combining brain computer interfaces with vision for object categorization,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Anchorage, AK, USA, 2008, pp. 1–8. [20] X. Hu et al., “Bridging low-level features and high-level semantics via fMRI brain imaging for video classification,” in Proc. ACM Int. Conf. Multimedia, Firenze, Italy, 2010, pp. 451–460. [21] U. Hasson, O. Furman, D. Clark, Y. Dudai, and L. Davachi, “Enhanced intersubject correlations during movie viewing correlate with successful episodic encoding,” Neuron, vol. 57, no. 3, pp. 452–462, 2008. [22] U. Hasson, Y. Nir, I. Levy, G. Fuhrmann, and R. Malach, “Intersubject synchronization of cortical activity during natural vision,” Science, vol. 303, no. 5664, pp. 1634–1640, 2004. [23] U. Hasson et al., “Neurocinematics: The neuroscience of film,” Projections, vol. 2, no. 1, pp. 1–26, 2008.

[24] D. Zhu et al., “DICCCOL: Dense individualized and common connectivity-based cortical landmarks,” Cereb. Cortex, vol. 23, pp. 786–800, Apr. 2013. [25] M. Mancas and O. Le Meur, “Memorability of natural scenes: The role of attention,” in Proc. IEEE Int. Conf. Image Process., Melbourne, VIC, Australia, 2013, pp. 196–200. [26] J. S. Pedro and S. Siersdorfer, “Ranking and classifying attractiveness of photos in folksonomies,” in Proc. Int. Conf. World Wide Web, Madrid, Spain, 2009, pp. 771–780. [27] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing on the subject,” in Proc. Eur. Conf. Comput. Vis., Marseille, France, 2008, pp. 386–399. [28] S. S. Cheung and A. Zakhor, “Efficient video similarity measurement and search,” in Proc. IEEE Int. Conf. Image Process., 2000, pp. 85–88. [29] O. Lartillot and P. Toiviainen, “A MATLAB toolbox for musical feature extraction from audio,” in Proc. Int. Conf. Digit. Audio Effects, Bordeaux, France, 2007, pp. 237–244. [30] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model for video summarization,” in Proc. ACM Int. Conf. Multimedia, Juan-les-Pins, France, 2002, pp. 533–542. [31] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, 2009, pp. 1778–1785. [32] J. Han et al., “Representing and retrieving video shots in human-centric brain imaging space,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2723–2736, Jul. 2013. [33] O. L. Meur, T. Baccino, and A. Roumy, “Prediction of the inter-observer visual congruency (IOVC) and application to image ranking,” in Proc. ACM Int. Conf. Multimedia, El Paso, TX, USA, 2011, pp. 373–382. [34] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 2, pp. 1–19, Feb. 2006. [35] B. Celikkale, A. Erdem, and E. Erdem, “Visual attention-driven spatial pooling for image memorability,” in Proc. IEEE Workshop Int. Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA, 2013, pp. 976–983. [36] J. Kim, S. Yoon, and V. Pavlovic, “Relative spatial features for image memorability,” in Proc. ACM Int. Conf. Multimedia, Barcelona, Spain, 2013, pp. 761–764. [37] (2005). TRECVID Database [Online]. Available: http://www-nlpir. nist.gov/projects/tv2005/ [38] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 27, pp. 1–27, 2011. [39] A. Khosla, W. Bainbridge, A. Torralba, and A. Oliva, “Modifying the memorability of face photographs,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 3200–3207. [40] A. Khosla, J. Xiao, P. Isola, A. Torralba, and A. Oliva, “Image memorability and visual inception,” in Proc. SIGGRAPH Asia, New York, NY, USA, 2012, pp. 1–4.

Junwei Han received the Ph.D. degree from Northwestern Polytechnical University, Xi’an, China, in 2003. He is currently a Professor with Northwestern Polytechnical University. His current research interests include computer vision and multimedia processing.

Changyuan Chen is currently pursuing the master’s degree from Northwestern Polytechnical University, Xi’an, China. His current research interests include multimedia information retrieval.

HAN et al.: LEARNING COMPUTATIONAL MODELS OF VM FROM fMRI BRAIN IMAGING

Ling Shao (M’09–SM’10) received the B.Eng. degree in electronic and information engineering from the University of Science and Technology of China, Hefei, China, the M.Sc. degree in medical image analysis and the Ph.D. (D.Phil.) degree in computer vision from the University of Oxford, Oxford, U.K. He is currently a Full Professor with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle, U.K. He was a Senior Lecturer with the Department of Electronic and Electrical Engineering with the University of Sheffield, Sheffield, U.K., from 2009 to 2014, and a Senior Scientist with Philips Research, Eindhoven, The Netherlands, from 2005 to 2009. His current research interests include computer vision, image/video processing, pattern recognition, and machine learning. He has authored/co-authored over 130 academic papers in refereed journals and conference proceedings and over ten EU/U.S. patents. Dr. Shao has been an Associate (or Guest) Editor of the IEEE T RANSACTIONS ON C YBERNETICS, Information Sciences, Pattern Recognition, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, and several other journals. He is a fellow of the British Computer Society and IET.

Xintao Hu received the M.S. and Ph.D. degrees from Northwestern Polytechnical University (NWPU), Xi’an, China, in 2005 and 2011, respectively. He is currently an Associate Professor of the School of Automation, NWPU. His current research interests include computational brain imaging and its application in computer vision.

1703

Jungong Han received the Ph.D. degree in telecommunication and information system from XiDian University, Xi’an, China, in 2004. During his Ph.D. study, he spent one year with the Internet Media Group of Microsoft Research Asia, Beijing, China. From 2005 to 2010, he was with Signal Processing Systems Group, Technical University of Eindhoven, Eindhoven, The Netherlands. In 2010, he joined the Multiagent and Adaptive Computation Group, Centre for Mathematics and Computer Science, Amsterdam, The Netherlands. In 2012, he served as a Senior Scientist with Civolution Technology, Eindhoven (a combining synergy of Philips Content Identification and Thomson STS). His current research interests include multimedia content identification, multisensor data fusion, and computer vision. He has authored and co-authored over 70 papers. Dr. Han is an Associate Editor of Elsevier Neurocomputing.

Tianming Liu received the Ph.D. degree in computer science from Shanghai Jiaotong University, Shanghai, China, in 2002. He is currently an Associate Professor of Computer Science from the University of Georgia (UGA), Athens, GA, USA. His current research interests include computational brain imaging. Before he moved to UGA, he was a faculty member with the Weill Medical College of Cornell University, New York, NY, USA, and Harvard Medical School, Boston, MA, USA. Prof. Liu is the recipient of the Microsoft Fellowship Award and the NIH NIBIBK01 Career Award.

Learning Computational Models of Video Memorability from fMRI Brain Imaging.

Generally, various visual media are unequally memorable by the human brain. This paper looks into a new direction of modeling the memorability of vide...
2MB Sizes 3 Downloads 4 Views