This GitHub repository hosts materials for the ACM MM '25 Tutorial titled "Multimodal Learning for Spatio-Temporal Data Mining (MM4ST)". All resources (slides, references) are publicly available to support attendees and researchers interested in multimodal spatio-temporal data mining.
Spatio-temporal data mining (STDM) is increasingly critical in multimedia, fueled by multimodal data from remote sensing, IoT sensors, social media, and mobile devices. Traditional single-modal methods fail to capture real-world complexity, while multimodal integration enables richer, more accurate insights. This half-day tutorial provides a comprehensive overview of STDM fundamentals, multimodal data challenges, advanced modeling techniques, and future research directions—equipping attendees to build scalable, robust spatio-temporal mining solutions.
- Conference: 33rd ACM International Conference on Multimedia (MM '25)
- Date: October 27–31, 2025
- Location: Dublin, Ireland
- DOI: https://doi.org/10.1145/3746027.3759204
- Website: https://mm4st.netlify.app/
The tutorial is structured into four core parts:
- Real-world application scenarios and value of STDM:
- Geological disaster response (rainfall data → disaster prediction → precautionary measures).
- Cloud resource autoscaling (user load data → future load prediction → scaling decisions).
- Navigation optimization (traffic flow data → congestion identification → path selection).
- New materials/drugs design (molecular structure data → property prediction → candidate screening).
- Core goal: Address "big challenges in big cities" via ST data modeling (win-win-win for environment, people, and cities).
- Definition of Spatio-Temporal (ST) Data: Integrates spatial (location), temporal (time), and event-related information to capture dynamic phenomena.
- ST Data Intelligence Framework: From urban sensing/data acquisition → data management → analytics → application (urban planning, traffic relief, pollution reduction).
- ST Data Taxonomy:
- Spatially and temporally static data (e.g., POI distributions, road networks).
- Spatially static and temporally dynamic data (e.g., weather/AQI station data, traffic flow).
- Spatially and temporally dynamic data (e.g., human mobility trajectories, animal migration).
- Key data types/sources: Geographical, traffic, social media, demographic, environment data, etc.
- Limitations of single-modality methods: Inability to address real-world complexity (e.g., incomplete information, lack of causality).
- Value of multimodal fusion: Complementary data sources boost accuracy and context (e.g., air quality inference with AQI, traffic, land use data; noise diagnosis with POIs, road networks, check-in data).
- Research gap: Traditional multimodal learning focuses on single-domain, aligned data—fails in cross-domain ST scenarios.
- Cross-domain data fusion definition: Integrating data from different domains (collected for different problems, originally unaligned) to extract knowledge.
- Future trend: Multimodal ST learning shifts from solving digital-world problems to physical-world challenges (e.g., AQI inference, urban management).
- Stage-based data fusion.
- Feature-level-based data fusion (feature concatenation + regularization, DNN-based).
- Semantic meaning-based fusion (multi-view learning, similarity-based, probabilistic dependency-based, transfer learning-based).
- Core fusion paradigms:
- Feature-based fusion (feature addition/multiplication, concatenation, graph-based fusion).
- Alignment-based fusion (cross-attention, encoder-based alignment).
- Contrast-based fusion (multimodal contrastive learning, e.g., CLIP, UrbanCLIP).
- Generation-based fusion (autoregression, masked modeling, diffusion-based, e.g., GeoMAN, DiffSTG).
- Key design trade-offs and real-world application adaptations.
- Advantages of language integration: Provides context, interpretability, and robustness (addresses traditional methods’ limitations).
- Core challenges: Data heterogeneity (numerical TS vs. discrete text), temporal alignment asynchrony, noise/irrelevant context.
- Fusion stages:
- Input level: Unify TS and text via prompts (e.g., Time-MQA, Time-LLM).
- Intermediate level: Aggregate embeddings (mean/concatenation) + alignment (self-attention, cross-attention, gating, graph convolution, contrastive learning).
- Output level: Project multimodal outputs to a unified space.
- Advantages of vision models: Continuous pixel sequences, support for multivariate TS, compact long-TS encoding, intuitive understanding.
- TS-to-Vision Transformation Methods (8 major types):
- Line Plot: Matches human perception (suitable for UTS/MTS with few variates).
- Heatmap (UVH/MVH): Straightforward for UTS/MTS (e.g., TimEHR, TimesNet, TimeMixer++).
- Spectrogram (STFT/Wavelet/Filterbank): Encodes time-frequency space (for UTS, e.g., audio TS analysis).
- GAF: Encodes temporal correlations (e.g., anomaly detection with CNNs).
- RP (Recurrent Plot): Captures periodicity/chaos (flexible image size via tuning m/τ).
- Key findings: Self-supervised vision models (MAE/SimMIM) outperform supervised ones; heatmaps/GAFs excel in TSC/TSF; decoder components are critical for prediction.
- Core takeaways: Multimodal ST fusion unlocks physical-world problem-solving; language/vision modalities complement traditional TS models.
- Future research directions:
- Enhance vision encoders for TSF (e.g., distillation) and mitigate period-based imaging bias.
- Optimize TS-to-vision mapping to resolve information density misalignment.
- Reduce computational costs via compression/efficient attention.
- Explore multimodal TS analysis with VLM Agents.
- Advance cross-domain knowledge transfer in ST multimodal learning.
| Name | Affiliation | |
|---|---|---|
| Siru Zhong | The Hong Kong University of Science and Technology (Guangzhou) | szhong691@connect.hkust-gz.edu.cn |
| Xixuan Hao | The Hong Kong University of Science and Technology (Guangzhou) | xhao390@connect.hkust-gz.edu.cn |
| Hao Miao | Aalborg University | haom@cs.aau.dk |
| Yan Zhao | University of Electronic Science and Technology of China | zhaoyan@uestc.edu.cn |
| Qingsong Wen | Squirrel Ai Learning | qingsongedu@gmail.com |
| Roger Zimmermann | National University of Singapore | rogerz@comp.nus.edu.sg |
| Yuxuan Liang (Corresponding) | The Hong Kong University of Science and Technology (Guangzhou) | yuxliang@outlook.com |
- WWW’25: Web-Centric Human Mobility Analytics (Sydney, Australia)
- KDD’24: Foundation Models for Time Series (Barcelona, Spain)
- ICDM’23: Time Series Analysis (Shanghai, China)
- KDD’22: Robust Time Series Analysis and Applications (Washington DC, USA)
- Service Route and Time Prediction in Instant Delivery (TKDE’24)
- Time Series Transformers (IJCAI’23, Paper Digest Most Influential Paper)
- Time Series Data Augmentation (IJCAI’21, Paper Digest Most Influential Paper)
- Self-supervised Learning and LLMs for Time Series (TPAMI’24, IJCAI’24)
- Cross-Domain Data Fusion in Urban Computing (Inf. Fusion’24)
- Multimodal ST Learning: UrbanVLP (AAAI’25), SatCLE (WWW’25), UrbanCross (MM’24)
- Applications: Airadar (AAAI’25), DynST (KDD’25), PatchSTG (KDD’25)
- Time Series Analysis: FEDformer (ICML’22), AirFormer (AAAI’23), Time-LLM (ICLR’24)
All materials are provided under the ACM Copyright License. Personal or classroom use is permitted without fee, provided copies are not distributed for profit or commercial advantage and include the full citation.
