I'm a Computer Science PhD student at USC's VIMAL interested in how we understand observations from multiple modalities (e.g. images, audio signals and written texts), and how we extract and build representations of the semantics that is invariant across the multimodal observations.
Before starting my PhD, I studied Mathematics and EECS (Electrical Engineering and Computer Science) at MIT for my Bachelors and Masters. Along the way, I interned at a French robotics startup Keecker and academic research labs in MIT's CSAIL, Media Lab, McGovern Institute and in INRIA. After my Masters, I worked at Apple as a COOP for 9 months.
My research interest lies at the intersection of representation learning and information theory, inspired by the way our perceptual system integrates multimodal sensory inputs via identifying invariant semantics. I am interested in understanding how the semantic information flows while processing observations from multiple modalities, using tools in deep learning and thermodynamic approaches to information flow.
My current guiding question is,
How do we extract the shared semantics from observations expressed in vastly different representational forms (eg. images, sounds, written texts), and how do we create/actualize various forms of observations, starting from the semantics we want to communicate?
I approach this question from an information-processing point of view and am developing generative models with disentangled representation to jointly learn the analysis and synthesis processes of multimodal data. My most recent work introduces a generative model with adversarial training that learns spatial semantics from map tiles collected from diverse sources such as satellites, Google Street Map and custom rendering engines.
Currently, I am exploring different ways to understand our proposed model, in particular, by measuring semantic information and studying the flow of information between the latent partitions.
It's exciting to see how the ideas and tools in thermodynamics can help quantify and visualize this flow of semantic information in our model :)
My journey started from noticing our own ability to (i) break down a complex observation into multiple chunks of smaller and abstract concepts and (ii) create a new idea by playing and recombining the conceptual building blocks. For instance, we can take a glimpse of this dance between abstraction and synthesis in a video of Picasso's live drawing:
-->
More specifically, I'm intrigued by how seamlessly we extract a common semantic content from observations in vastly different representational forms (such as languages, images, gestures or sounds, and infinitely many forms within each modality), and reversely, how a semantic content can be expressed in various forms without losing its (overall) meaning Hmm.. coarse-graining?.
My exploration starts with an hypothesis that a phenomena in reality, from which our observations stem from, contains semantic potentials("potential" as in potential energy in Physics, or going further up the stream, as in Aristotle's "Potentiality and actuality" This idea influenced Leibniz to develop the science of "dynamics". Learning about such influence brings light into what Leibniz was struggling to hit the chord with ideas like 'power' and 'action'. Contemplate: Aristotle's "potential:actuality" vs. Leibniz's "power:action".)
I wonder,
For instance, consider the following observations: \(X^A\) is an image of a dog barking on the door, \(X^B\) is a recording of a dog barking, and \(X^C\) is a sentence written in the English language. The semantic content shared among the observations is "there is a dog barking", and each observation is the result of expressing (synm: rendering, stylizing) the semantic content into a form proper for its modality (ie. image, sound, written English language, respectively).
My question, at the representational level is, how do we identify the underlying, shared semantic contents from the information about domain-specific variations?
Now let's flip the question and consider the process of synthesis. I start with a concept that I'd like to express and communicate. For example, I want to actualize the idea of "a dog barking at the door". If I ask you to express this content as an image, sounds, and an English sentence, what would be the process of such domain-specific actualization of a semantic information?
Geometry of a modality space: imposes geometric constraints that an instance must satisfy to be a valid observations of that modality
The breakdown of main components of my questions looks as follows:
In pursuit of this computational model of understanding and generating multimodal data, I am developing generative models with disentangled representation to jointly learn the analysis and synthesis processes of complex, high-dimensional data (eg. satellite images, knowledge bases) with compact and “meaningful” representations. I'm working with Prof. Wael Abd-Almageed at ISI's VIMAL, focusing on various types of generative models for this goal. My project with Prof. Yao-Yi Chiang and Prof. Craig Knoblock tackles this line of questions using geospatial data, and aims to learn spatial semantics from data that are collected from diverse sources (eg. satellites, Google Street Map, historical maps) and stored in diverse format (eg. images, graphs). This work has potential applications such as global-scale urban environment analysis, automated map synthesis and systems for monitoring environmental changes.
Within the domain of representation learning, I’m most interested in variational inference methods, especially recent developments in deep generative models such as variational autoencoders (VAEs) and the idea of adversarial training.
Using a VAE-variant model and adversarial training, I’m investigating how we can build a model that extracts invariance in a dataset of heterogeneous representations via VAEs and adversarial training. One of my current projects investigates this question in the domain of spatial informatics, using our new dataset of map tiles from diverse sources.
More about next steps...
My interests center around understanding of complex data and the processes through which such understanding happens. Here are some snapshots along that journey:
Interactive visualizations that represent high-dimensional data accurately and efficiently
Learning dynamics of neural net
See more details here.
More importantly, I'm practicing to:
- observe without being entangled in what is personal
- look at small thoughts carefully
- not to rush
- spend most of time on what matters most
- be gentle
- be slow
- be curious
- question
- relax in discomforts
- greet what is as what is
- nothing more, nothing less
- stay open
‘Your act was unwise,’ I exclaimed, ‘as you see by the outcome.’
He solemnly eyed me.
‘When choosing the course of my action,’ said he, ‘I had not the outcome to guide me.’
- Ambrose BierceIntention and attention.