Visible to the public Biblio

Filters: Keyword is image representation  [Clear All Filters]
2022-02-09
Guo, Hao, Dolhansky, Brian, Hsin, Eric, Dinh, Phong, Ferrer, Cristian Canton, Wang, Song.  2021.  Deep Poisoning: Towards Robust Image Data Sharing against Visual Disclosure. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). :686–696.
Due to respectively limited training data, different entities addressing the same vision task based on certain sensitive images may not train a robust deep network. This paper introduces a new vision task where various entities share task-specific image data to enlarge each other's training data volume without visually disclosing sensitive contents (e.g. illegal images). Then, we present a new structure-based training regime to enable different entities learn task-specific and reconstruction-proof image representations for image data sharing. Specifically, each entity learns a private Deep Poisoning Module (DPM) and insert it to a pre-trained deep network, which is designed to perform the specific vision task. The DPM deliberately poisons convolutional image features to prevent image reconstructions, while ensuring that the altered image data is functionally equivalent to the non-poisoned data for the specific vision task. Given this equivalence, the poisoned features shared from one entity could be used by another entity for further model refinement. Experimental results on image classification prove the efficacy of the proposed method.
2021-03-04
Carlini, N., Farid, H..  2020.  Evading Deepfake-Image Detectors with White- and Black-Box Attacks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). :2804—2813.

It is now possible to synthesize highly realistic images of people who do not exist. Such content has, for example, been implicated in the creation of fraudulent socialmedia profiles responsible for dis-information campaigns. Significant efforts are, therefore, being deployed to detect synthetically-generated content. One popular forensic approach trains a neural network to distinguish real from synthetic content.We show that such forensic classifiers are vulnerable to a range of attacks that reduce the classifier to near- 0% accuracy. We develop five attack case studies on a state- of-the-art classifier that achieves an area under the ROC curve (AUC) of 0.95 on almost all existing image generators, when only trained on one generator. With full access to the classifier, we can flip the lowest bit of each pixel in an image to reduce the classifier's AUC to 0.0005; perturb 1% of the image area to reduce the classifier's AUC to 0.08; or add a single noise pattern in the synthesizer's latent space to reduce the classifier's AUC to 0.17. We also develop a black-box attack that, with no access to the target classifier, reduces the AUC to 0.22. These attacks reveal significant vulnerabilities of certain image-forensic classifiers.

2021-02-08
Nisperos, Z. A., Gerardo, B., Hernandez, A..  2020.  Key Generation for Zero Steganography Using DNA Sequences. 2020 12th International Conference on Electronics, Computers and Artificial Intelligence (ECAI). :1–6.
Some of the key challenges in steganography are imperceptibility and resistance to detection of steganalysis algorithms. Zero steganography is an approach to data hiding such that the cover image is not modified. This paper focuses on the generation of stego-key, which is an essential component of this steganographic approach. This approach utilizes DNA sequences and shifting and flipping operations in its binary code representation. Experimental results show that the key generation algorithm has a low cracking probability. The algorithm satisfies the avalanche criterion.
2020-12-11
Friedrich, T., Menzel, S..  2019.  Standardization of Gram Matrix for Improved 3D Neural Style Transfer. 2019 IEEE Symposium Series on Computational Intelligence (SSCI). :1375—1382.

Neural Style Transfer based on convolutional neural networks has produced visually appealing results for image and video data in the recent years where e.g. the content of a photo and the style of a painting are merged to a novel piece of digital art. In practical engineering development, we utilize 3D objects as standard for optimizing digital shapes. Since these objects can be represented as binary 3D voxel representation, we propose to extend the Neural Style Transfer method to 3D geometries in analogy to 2D pixel representations. In a series of experiments, we first evaluate traditional Neural Style Transfer on 2D binary monochromatic images. We show that this method produces reasonable results on binary images lacking color information and even improve them by introducing a standardized Gram matrix based loss function for style. For an application of Neural Style Transfer on 3D voxel primitives, we trained several classifier networks demonstrating the importance of a meaningful convolutional network architecture. The standardization of the Gram matrix again strongly contributes to visually improved, less noisy results. We conclude that Neural Style Transfer extended by a standardization of the Gram matrix is a promising approach for generating novel 3D voxelized objects and expect future improvements with increasing graphics memory availability for finer object resolutions.

2020-12-07
Zhang, Y., Zhang, Y., Cai, W..  2018.  Separating Style and Content for Generalized Style Transfer. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. :8447–8455.

Neural style transfer has drawn broad attention in recent years. However, most existing methods aim to explicitly model the transformation between different styles, and the learned model is thus not generalizable to new styles. We here attempt to separate the representations for styles and contents, and propose a generalized style transfer network consisting of style encoder, content encoder, mixer and decoder. The style encoder and content encoder are used to extract the style and content factors from the style reference images and content reference images, respectively. The mixer employs a bilinear model to integrate the above two factors and finally feeds it into a decoder to generate images with target style and content. To separate the style features and content features, we leverage the conditional dependence of styles and contents given an image. During training, the encoder network learns to extract styles and contents from two sets of reference images in limited size, one with shared style and the other with shared content. This learning framework allows simultaneous style transfer among multiple styles and can be deemed as a special 'multi-task' learning scenario. The encoders are expected to capture the underlying features for different styles and contents which is generalizable to new styles and contents. For validation, we applied the proposed algorithm to the Chinese Typeface transfer problem. Extensive experiment results on character generation have demonstrated the effectiveness and robustness of our method.

2020-10-05
Kumar, Suren, Dhiman, Vikas, Koch, Parker A, Corso, Jason J..  2018.  Learning Compositional Sparse Bimodal Models. IEEE Transactions on Pattern Analysis and Machine Intelligence. 40:1032—1044.

Various perceptual domains have underlying compositional semantics that are rarely captured in current models. We suspect this is because directly learning the compositional structure has evaded these models. Yet, the compositional structure of a given domain can be grounded in a separate domain thereby simplifying its learning. To that end, we propose a new approach to modeling bimodal perceptual domains that explicitly relates distinct projections across each modality and then jointly learns a bimodal sparse representation. The resulting model enables compositionality across these distinct projections and hence can generalize to unobserved percepts spanned by this compositional basis. For example, our model can be trained on red triangles and blue squares; yet, implicitly will also have learned red squares and blue triangles. The structure of the projections and hence the compositional basis is learned automatically; no assumption is made on the ordering of the compositional elements in either modality. Although our modeling paradigm is general, we explicitly focus on a tabletop building-blocks setting. To test our model, we have acquired a new bimodal dataset comprising images and spoken utterances of colored shapes (blocks) in the tabletop setting. Our experiments demonstrate the benefits of explicitly leveraging compositionality in both quantitative and human evaluation studies.

Lee, Haanvid, Jung, Minju, Tani, Jun.  2018.  Recognition of Visually Perceived Compositional Human Actions by Multiple Spatio-Temporal Scales Recurrent Neural Networks. IEEE Transactions on Cognitive and Developmental Systems. 10:1058—1069.

We investigate a deep learning model for action recognition that simultaneously extracts spatio-temporal information from a raw RGB input data. The proposed multiple spatio-temporal scales recurrent neural network (MSTRNN) model is derived by combining multiple timescale recurrent dynamics with a conventional convolutional neural network model. The architecture of the proposed model imposes both spatial and temporal constraints simultaneously on its neural activities. The constraints vary, with multiple scales in different layers. As suggested by the principle of upward and downward causation, it is assumed that the network can develop a functional hierarchy using its constraints during training. To evaluate and observe the characteristics of the proposed model, we use three human action datasets consisting of different primitive actions and different compositionality levels. The performance capabilities of the MSTRNN model on these datasets are compared with those of other representative deep learning models used in the field. The results show that the MSTRNN outperforms baseline models while using fewer parameters. The characteristics of the proposed model are observed by analyzing its internal representation properties. The analysis clarifies how the spatio-temporal constraints of the MSTRNN model aid in how it extracts critical spatio-temporal information relevant to its given tasks.

Chakraborty, Anit, Dutta, Sayandip, Bhattacharyya, Siddhartha, Platos, Jan, Snasel, Vaclav.  2018.  Reinforcement Learning inspired Deep Learned Compositional Model for Decision Making in Tracking. 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). :158—163.

We formulate a tracker which performs incessant decision making in order to track objects where the objects may undergo different challenges such as partial occlusions, moving camera, cluttered background etc. In the process, the agent must make a decision on whether to keep track of the object when it is occluded or has moved out of the frame temporarily based on its prediction from the previous location or to reinitialize the tracker based on the belief that the target has been lost. Instead of the heuristic methods we depend on reward and penalty based training that helps the agent reach an optimal solution via this partially observable Markov decision making (POMDP). Furthermore, we employ deeply learned compositional model to estimate human pose in order to better handle occlusion without needing human inputs. By learning compositionality of human bodies via deep neural network the agent can make better decision on presence of human in a frame or lack thereof under occlusion. We adapt skeleton based part representation and do away with the large spatial state requirement. This especially helps in cases where orientation of the target in focus is unorthodox. Finally we demonstrate that the deep reinforcement learning based training coupled with pose estimation capabilities allows us to train and tag multiple large video datasets much quicker than previous works.

2020-07-30
Perez, Claudio A., Estévez, Pablo A, Galdames, Francisco J., Schulz, Daniel A., Perez, Juan P., Bastías, Diego, Vilar, Daniel R..  2018.  Trademark Image Retrieval Using a Combination of Deep Convolutional Neural Networks. 2018 International Joint Conference on Neural Networks (IJCNN). :1—7.
Trademarks are recognizable images and/or words used to distinguish various products or services. They become associated with the reputation, innovation, quality, and warranty of the products. Countries around the world have offices for industrial/intellectual property (IP) registration. A new trademark image in application for registration should be distinct from all the registered trademarks. Due to the volume of trademark registration applications and the size of the databases containing existing trademarks, it is impossible for humans to make all the comparisons visually. Therefore, technological tools are essential for this task. In this work we use a pre-trained, publicly available Convolutional Neural Network (CNN) VGG19 that was trained on the ImageNet database. We adapted the VGG19 for the trademark image retrieval (TIR) task by fine tuning the network using two different databases. The VGG19v was trained with a database organized with trademark images using visual similarities, and the VGG19c was trained using trademarks organized by using conceptual similarities. The database for the VGG19v was built using trademarks downloaded from the WEB, and organized by visual similarity according to experts from the IP office. The database for the VGG19c was built using trademark images from the United States Patent and Trademarks Office and organized according to the Vienna conceptual protocol. The TIR was assessed using the normalized average rank for a test set from the METU database that has 922,926 trademark images. We computed the normalized average ranks for VGG19v, VGG19c, and for a combination of both networks. Our method achieved significantly better results on the METU database than those published previously.
2020-05-22
Dubey, Abhimanyu, Maaten, Laurens van der, Yalniz, Zeki, Li, Yixuan, Mahajan, Dhruv.  2019.  Defense Against Adversarial Images Using Web-Scale Nearest-Neighbor Search. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). :8759—8768.
A plethora of recent work has shown that convolutional networks are not robust to adversarial images: images that are created by perturbing a sample from the data distribution as to maximize the loss on the perturbed example. In this work, we hypothesize that adversarial perturbations move the image away from the image manifold in the sense that there exists no physical process that could have produced the adversarial image. This hypothesis suggests that a successful defense mechanism against adversarial images should aim to project the images back onto the image manifold. We study such defense mechanisms, which approximate the projection onto the unknown image manifold by a nearest-neighbor search against a web-scale image database containing tens of billions of images. Empirical evaluations of this defense strategy on ImageNet suggest that it very effective in attack settings in which the adversary does not have access to the image database. We also propose two novel attack methods to break nearest-neighbor defense settings and show conditions under which nearest-neighbor defense fails. We perform a series of ablation experiments, which suggest that there is a trade-off between robustness and accuracy between as we use features from deeper in the network, that a large index size (hundreds of millions) is crucial to get good performance, and that careful construction of database is crucial for robustness against nearest-neighbor attacks.
2019-06-24
Naeem, H., Guo, B., Naeem, M. R..  2018.  A light-weight malware static visual analysis for IoT infrastructure. 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD). :240–244.

Recently a huge trend on the internet of things (IoT) and an exponential increase in automated tools are helping malware producers to target IoT devices. The traditional security solutions against malware are infeasible due to low computing power for large-scale data in IoT environment. The number of malware and their variants are increasing due to continuous malware attacks. Consequently, the performance improvement in malware analysis is critical requirement to stop rapid expansion of malicious attacks in IoT environment. To solve this problem, the paper proposed a novel framework for classifying malware in IoT environment. To achieve flne-grained malware classification in suggested framework, the malware image classification system (MICS) is designed for representing malware image globally and locally. MICS first converts the suspicious program into the gray-scale image and then captures hybrid local and global malware features to perform malware family classification. Preliminary experimental outcomes of MICS are quite promising with 97.4% classification accuracy on 9342 windows suspicious programs of 25 families. The experimental results indicate that proposed framework is quite capable to process large-scale IoT malware.

2019-06-10
Kornish, D., Geary, J., Sansing, V., Ezekiel, S., Pearlstein, L., Njilla, L..  2018.  Malware Classification Using Deep Convolutional Neural Networks. 2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). :1-6.

In recent years, deep convolution neural networks (DCNNs) have won many contests in machine learning, object detection, and pattern recognition. Furthermore, deep learning techniques achieved exceptional performance in image classification, reaching accuracy levels beyond human capability. Malware variants from similar categories often contain similarities due to code reuse. Converting malware samples into images can cause these patterns to manifest as image features, which can be exploited for DCNN classification. Techniques for converting malware binaries into images for visualization and classification have been reported in the literature, and while these methods do reach a high level of classification accuracy on training datasets, they tend to be vulnerable to overfitting and perform poorly on previously unseen samples. In this paper, we explore and document a variety of techniques for representing malware binaries as images with the goal of discovering a format best suited for deep learning. We implement a database for malware binaries from several families, stored in hexadecimal format. These malware samples are converted into images using various approaches and are used to train a neural network to recognize visual patterns in the input and classify malware based on the feature vectors. Each image type is assessed using a variety of learning models, such as transfer learning with existing DCNN architectures and feature extraction for support vector machine classifier training. Each technique is evaluated in terms of classification accuracy, result consistency, and time per trial. Our preliminary results indicate that improved image representation has the potential to enable more effective classification of new malware.

2019-01-21
Kos, J., Fischer, I., Song, D..  2018.  Adversarial Examples for Generative Models. 2018 IEEE Security and Privacy Workshops (SPW). :36–42.

We explore methods of producing adversarial examples on deep generative models such as the variational autoencoder (VAE) and the VAE-GAN. Deep learning architectures are known to be vulnerable to adversarial examples, but previous work has focused on the application of adversarial examples to classification tasks. Deep generative models have recently become popular due to their ability to model input data distributions and generate realistic examples from those distributions. We present three classes of attacks on the VAE and VAE-GAN architectures and demonstrate them against networks trained on MNIST, SVHN and CelebA. Our first attack leverages classification-based adversaries by attaching a classifier to the trained encoder of the target generative model, which can then be used to indirectly manipulate the latent representation. Our second attack directly uses the VAE loss function to generate a target reconstruction image from the adversarial example. Our third attack moves beyond relying on classification or the standard loss for the gradient and directly optimizes against differences in source and target latent representations. We also motivate why an attacker might be interested in deploying such techniques against a target generative network.

2018-11-19
Wang, X., Oxholm, G., Zhang, D., Wang, Y..  2017.  Multimodal Transfer: A Hierarchical Deep Convolutional Neural Network for Fast Artistic Style Transfer. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). :7178–7186.

Transferring artistic styles onto everyday photographs has become an extremely popular task in both academia and industry. Recently, offline training has replaced online iterative optimization, enabling nearly real-time stylization. When those stylization networks are applied directly to high-resolution images, however, the style of localized regions often appears less similar to the desired artistic style. This is because the transfer process fails to capture small, intricate textures and maintain correct texture scales of the artworks. Here we propose a multimodal convolutional neural network that takes into consideration faithful representations of both color and luminance channels, and performs stylization hierarchically with multiple losses of increasing scales. Compared to state-of-the-art networks, our network can also perform style transfer in nearly real-time by performing much more sophisticated training offline. By properly handling style and texture cues at multiple scales using several modalities, we can transfer not just large-scale, obvious style cues but also subtle, exquisite ones. That is, our scheme can generate results that are visually pleasing and more similar to multiple desired artistic styles with color and texture cues at multiple scales.

Li, P., Zhao, L., Xu, D., Lu, D..  2018.  Incorporating Multiscale Contextual Loss for Image Style Transfer. 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). :241–245.

In this paper, we propose to impose a multiscale contextual loss for image style transfer based on Convolutional Neural Networks (CNN). In the traditional optimization framework, a new stylized image is synthesized by constraining the high-level CNN features similar to a content image and the lower-level CNN features similar to a style image, which, however, appears to lost many details of the content image, presenting unpleasing and inconsistent distortions or artifacts. The proposed multiscale contextual loss, named Haar loss, is responsible for preserving the lost details by dint of matching the features derived from the content image and the synthesized image via wavelet transform. It endows the synthesized image with the characteristic to better retain the semantic information of the content image. More specifically, the unpleasant distortions can be effectively alleviated while the style can be well preserved. In the experiments, we show the visually more consistent and simultaneously well-stylized images generated by incorporating the multiscale contextual loss.

2018-05-02
Li, F., Jiang, M., Zhang, Z..  2017.  An adaptive sparse representation model by block dictionary and swarm intelligence. 2017 2nd IEEE International Conference on Computational Intelligence and Applications (ICCIA). :200–203.

The pattern recognition in the sparse representation (SR) framework has been very successful. In this model, the test sample can be represented as a sparse linear combination of training samples by solving a norm-regularized least squares problem. However, the value of regularization parameter is always indiscriminating for the whole dictionary. To enhance the group concentration of the coefficients and also to improve the sparsity, we propose a new SR model called adaptive sparse representation classifier(ASRC). In ASRC, a sparse coefficient strengthened item is added in the objective function. The model is solved by the artificial bee colony (ABC) algorithm with variable step to speed up the convergence. Also, a partition strategy for large scale dictionary is adopted to lighten bee's load and removes the irrelevant groups. Through different data sets, we empirically demonstrate the property of the new model and its recognition performance.

2018-05-01
Zhao, H., Ren, J., Pei, Z., Cai, Z., Dai, Q., Wei, W..  2017.  Compressive Sensing Based Feature Residual for Image Steganalysis Detection. 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). :1096–1100.

Based on the feature analysis of image content, this paper proposes a novel steganalytic method for grayscale images in spatial domain. In this work, we firstly investigates directional lifting wavelet transform (DLWT) as a sparse representation in compressive sensing (CS) domain. Then a block CS (BCS) measurement matrix is designed by using the generalized Gaussian distribution (GGD) model, in which the measurement matrix can be used to sense the DLWT coefficients of images to reflect the feature residual introduced by steganography. Extensive experiments are showed that proposed scheme CS-based is feasible and universal for detecting stegography in spatial domain.

2018-04-04
Bao, D., Yang, F., Jiang, Q., Li, S., He, X..  2017.  Block RLS algorithm for surveillance video processing based on image sparse representation. 2017 29th Chinese Control And Decision Conference (CCDC). :2195–2200.

Block recursive least square (BRLS) algorithm for dictionary learning in compressed sensing system is developed for surveillance video processing. The new method uses image blocks directly and iteratively to train dictionaries via BRLS algorithm, which is different from classical methods that require to transform blocks to columns first and then giving all training blocks at one time. Since the background in surveillance video is almost fixed, the residual of foreground can be represented sparsely and reconstructed with background subtraction directly. The new method and framework are applied in real image and surveillance video processing. Simulation results show that the new method achieves better representation performance than classical ones in both image and surveillance video.

Parchami, M., Bashbaghi, S., Granger, E..  2017.  CNNs with cross-correlation matching for face recognition in video surveillance using a single training sample per person. 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). :1–6.

In video surveillance, face recognition (FR) systems seek to detect individuals of interest appearing over a distributed network of cameras. Still-to-video FR systems match faces captured in videos under challenging conditions against facial models, often designed using one reference still per individual. Although CNNs can achieve among the highest levels of accuracy in many real-world FR applications, state-of-the-art CNNs that are suitable for still-to-video FR, like trunk-branch ensemble (TBE) CNNs, represent complex solutions for real-time applications. In this paper, an efficient CNN architecture is proposed for accurate still-to-video FR from a single reference still. The CCM-CNN is based on new cross-correlation matching (CCM) and triplet-loss optimization methods that provide discriminant face representations. The matching pipeline exploits a matrix Hadamard product followed by a fully connected layer inspired by adaptive weighted cross-correlation. A triplet-based training approach is proposed to optimize the CCM-CNN parameters such that the inter-class variations are increased, while enhancing robustness to intra-class variations. To further improve robustness, the network is fine-tuned using synthetically-generated faces based on still and videos of non-target individuals. Experiments on videos from the COX Face and Chokepoint datasets indicate that the CCM-CNN can achieve a high level of accuracy that is comparable to TBE-CNN and HaarNet, but with a significantly lower time and memory complexity. It may therefore represent the better trade-off between accuracy and complexity for real-time video surveillance applications.

2018-02-28
Su, J. C., Wu, C., Jiang, H., Maji, S..  2017.  Reasoning About Fine-Grained Attribute Phrases Using Reference Games. 2017 IEEE International Conference on Computer Vision (ICCV). :418–427.

We present a framework for learning to describe finegrained visual differences between instances using attribute phrases. Attribute phrases capture distinguishing aspects of an object (e.g., “propeller on the nose” or “door near the wing” for airplanes) in a compositional manner. Instances within a category can be described by a set of these phrases and collectively they span the space of semantic attributes for a category. We collect a large dataset of such phrases by asking annotators to describe several visual differences between a pair of instances within a category. We then learn to describe and ground these phrases to images in the context of a reference game between a speaker and a listener. The goal of a speaker is to describe attributes of an image that allows the listener to correctly identify it within a pair. Data collected in a pairwise manner improves the ability of the speaker to generate, and the ability of the listener to interpret visual descriptions. Moreover, due to the compositionality of attribute phrases, the trained listeners can interpret descriptions not seen during training for image retrieval, and the speakers can generate attribute-based explanations for differences between previously unseen categories. We also show that embedding an image into the semantic space of attribute phrases derived from listeners offers 20% improvement in accuracy over existing attributebased representations on the FGVC-aircraft dataset.

2017-03-08
Kerl, C., Stückler, J., Cremers, D..  2015.  Dense Continuous-Time Tracking and Mapping with Rolling Shutter RGB-D Cameras. 2015 IEEE International Conference on Computer Vision (ICCV). :2264–2272.

We propose a dense continuous-time tracking and mapping method for RGB-D cameras. We parametrize the camera trajectory using continuous B-splines and optimize the trajectory through dense, direct image alignment. Our method also directly models rolling shutter in both RGB and depth images within the optimization, which improves tracking and reconstruction quality for low-cost CMOS sensors. Using a continuous trajectory representation has a number of advantages over a discrete-time representation (e.g. camera poses at the frame interval). With splines, less variables need to be optimized than with a discrete representation, since the trajectory can be represented with fewer control points than frames. Splines also naturally include smoothness constraints on derivatives of the trajectory estimate. Finally, the continuous trajectory representation allows to compensate for rolling shutter effects, since a pose estimate is available at any exposure time of an image. Our approach demonstrates superior quality in tracking and reconstruction compared to approaches with discrete-time or global shutter assumptions.

2015-05-01
Hammoud, R.I., Sahin, C.S., Blasch, E.P., Rhodes, B.J..  2014.  Multi-source Multi-modal Activity Recognition in Aerial Video Surveillance. Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on. :237-244.

Recognizing activities in wide aerial/overhead imagery remains a challenging problem due in part to low-resolution video and cluttered scenes with a large number of moving objects. In the context of this research, we deal with two un-synchronized data sources collected in real-world operating scenarios: full-motion videos (FMV) and analyst call-outs (ACO) in the form of chat messages (voice-to-text) made by a human watching the streamed FMV from an aerial platform. We present a multi-source multi-modal activity/event recognition system for surveillance applications, consisting of: (1) detecting and tracking multiple dynamic targets from a moving platform, (2) representing FMV target tracks and chat messages as graphs of attributes, (3) associating FMV tracks and chat messages using a probabilistic graph-based matching approach, and (4) detecting spatial-temporal activity boundaries. We also present an activity pattern learning framework which uses the multi-source associated data as training to index a large archive of FMV videos. Finally, we describe a multi-intelligence user interface for querying an index of activities of interest (AOIs) by movement type and geo-location, and for playing-back a summary of associated text (ACO) and activity video segments of targets-of-interest (TOIs) (in both pixel and geo-coordinates). Such tools help the end-user to quickly search, browse, and prepare mission reports from multi-source data.