MM4ST: Multimodal Learning for Spatio-Temporal Data Mining

This GitHub repository hosts materials for the ACM MM '25 Tutorial titled "Multimodal Learning for Spatio-Temporal Data Mining (MM4ST)". All resources (slides, references) are publicly available to support attendees and researchers interested in multimodal spatio-temporal data mining.

Tutorial Overview

Spatio-temporal data mining (STDM) is increasingly critical in multimedia, fueled by multimodal data from remote sensing, IoT sensors, social media, and mobile devices. Traditional single-modal methods fail to capture real-world complexity, while multimodal integration enables richer, more accurate insights. This half-day tutorial provides a comprehensive overview of STDM fundamentals, multimodal data challenges, advanced modeling techniques, and future research directions—equipping attendees to build scalable, robust spatio-temporal mining solutions.

Tutorial Details

Conference: 33rd ACM International Conference on Multimedia (MM '25)
Date: October 27–31, 2025
Location: Dublin, Ireland
DOI: https://doi.org/10.1145/3746027.3759204
Website: https://mm4st.netlify.app/

Tutorial Outline

The tutorial is structured into four core parts:

Part 1: Background & Examples

Real-world application scenarios and value of STDM:
- Geological disaster response (rainfall data → disaster prediction → precautionary measures).
- Cloud resource autoscaling (user load data → future load prediction → scaling decisions).
- Navigation optimization (traffic flow data → congestion identification → path selection).
- New materials/drugs design (molecular structure data → property prediction → candidate screening).
Core goal: Address "big challenges in big cities" via ST data modeling (win-win-win for environment, people, and cities).

Part 2: Foundation of ST Data

Definition of Spatio-Temporal (ST) Data: Integrates spatial (location), temporal (time), and event-related information to capture dynamic phenomena.
ST Data Intelligence Framework: From urban sensing/data acquisition → data management → analytics → application (urban planning, traffic relief, pollution reduction).
ST Data Taxonomy:
- Spatially and temporally static data (e.g., POI distributions, road networks).
- Spatially static and temporally dynamic data (e.g., weather/AQI station data, traffic flow).
- Spatially and temporally dynamic data (e.g., human mobility trajectories, animal migration).
Key data types/sources: Geographical, traffic, social media, demographic, environment data, etc.

Part 3: Why Multimodal ST Data Fusion

Limitations of single-modality methods: Inability to address real-world complexity (e.g., incomplete information, lack of causality).
Value of multimodal fusion: Complementary data sources boost accuracy and context (e.g., air quality inference with AQI, traffic, land use data; noise diagnosis with POIs, road networks, check-in data).
Research gap: Traditional multimodal learning focuses on single-domain, aligned data—fails in cross-domain ST scenarios.
Cross-domain data fusion definition: Integrating data from different domains (collected for different problems, originally unaligned) to extract knowledge.
Future trend: Multimodal ST learning shifts from solving digital-world problems to physical-world challenges (e.g., AQI inference, urban management).

Part 4: Principle of ST Multimodal Fusion

Machine Learning Era Fusion Methods

Stage-based data fusion.
Feature-level-based data fusion (feature concatenation + regularization, DNN-based).
Semantic meaning-based fusion (multi-view learning, similarity-based, probabilistic dependency-based, transfer learning-based).

Deep Learning Era Fusion Methods

Core fusion paradigms:
- Feature-based fusion (feature addition/multiplication, concatenation, graph-based fusion).
- Alignment-based fusion (cross-attention, encoder-based alignment).
- Contrast-based fusion (multimodal contrastive learning, e.g., CLIP, UrbanCLIP).
- Generation-based fusion (autoregression, masked modeling, diffusion-based, e.g., GeoMAN, DiffSTG).
Key design trade-offs and real-world application adaptations.

Part 5: Visual/Language Knowledge Transfer

Subpart 5.1: Language-enhanced Spatio-Temporal Analysis

Advantages of language integration: Provides context, interpretability, and robustness (addresses traditional methods’ limitations).
Core challenges: Data heterogeneity (numerical TS vs. discrete text), temporal alignment asynchrony, noise/irrelevant context.
Fusion stages:
- Input level: Unify TS and text via prompts (e.g., Time-MQA, Time-LLM).
- Intermediate level: Aggregate embeddings (mean/concatenation) + alignment (self-attention, cross-attention, gating, graph convolution, contrastive learning).
- Output level: Project multimodal outputs to a unified space.

Subpart 5.2: Vision-enhanced Spatio-Temporal Analysis

Advantages of vision models: Continuous pixel sequences, support for multivariate TS, compact long-TS encoding, intuitive understanding.
TS-to-Vision Transformation Methods (8 major types):
- Line Plot: Matches human perception (suitable for UTS/MTS with few variates).
- Heatmap (UVH/MVH): Straightforward for UTS/MTS (e.g., TimEHR, TimesNet, TimeMixer++).
- Spectrogram (STFT/Wavelet/Filterbank): Encodes time-frequency space (for UTS, e.g., audio TS analysis).
- GAF: Encodes temporal correlations (e.g., anomaly detection with CNNs).
- RP (Recurrent Plot): Captures periodicity/chaos (flexible image size via tuning m/τ).
Key findings: Self-supervised vision models (MAE/SimMIM) outperform supervised ones; heatmaps/GAFs excel in TSC/TSF; decoder components are critical for prediction.

Part 6: Conclusions & Future Directions

Core takeaways: Multimodal ST fusion unlocks physical-world problem-solving; language/vision modalities complement traditional TS models.
Future research directions:
- Enhance vision encoders for TSF (e.g., distillation) and mitigate period-based imaging bias.
- Optimize TS-to-vision mapping to resolve information density misalignment.
- Reduce computational costs via compression/efficient attention.
- Explore multimodal TS analysis with VLM Agents.
- Advance cross-domain knowledge transfer in ST multimodal learning.

Organizers

Name	Affiliation	Email
Siru Zhong	The Hong Kong University of Science and Technology (Guangzhou)	szhong691@connect.hkust-gz.edu.cn
Xixuan Hao	The Hong Kong University of Science and Technology (Guangzhou)	xhao390@connect.hkust-gz.edu.cn
Hao Miao	Aalborg University	haom@cs.aau.dk
Yan Zhao	University of Electronic Science and Technology of China	zhaoyan@uestc.edu.cn
Qingsong Wen	Squirrel Ai Learning	qingsongedu@gmail.com
Roger Zimmermann	National University of Singapore	rogerz@comp.nus.edu.sg
Yuxuan Liang (Corresponding)	The Hong Kong University of Science and Technology (Guangzhou)	yuxliang@outlook.com

Related Resources

Previous Tutorials by Organizers

WWW’25: Web-Centric Human Mobility Analytics (Sydney, Australia)
KDD’24: Foundation Models for Time Series (Barcelona, Spain)
ICDM’23: Time Series Analysis (Shanghai, China)
KDD’22: Robust Time Series Analysis and Applications (Washington DC, USA)

Surveys by Organizers

Service Route and Time Prediction in Instant Delivery (TKDE’24)
Time Series Transformers (IJCAI’23, Paper Digest Most Influential Paper)
Time Series Data Augmentation (IJCAI’21, Paper Digest Most Influential Paper)
Self-supervised Learning and LLMs for Time Series (TPAMI’24, IJCAI’24)
Cross-Domain Data Fusion in Urban Computing (Inf. Fusion’24)

Key Papers by Organizers

Multimodal ST Learning: UrbanVLP (AAAI’25), SatCLE (WWW’25), UrbanCross (MM’24)
Applications: Airadar (AAAI’25), DynST (KDD’25), PatchSTG (KDD’25)
Time Series Analysis: FEDformer (ICML’22), AirFormer (AAAI’23), Time-LLM (ICLR’24)

License

All materials are provided under the ACM Copyright License. Personal or classroom use is permitted without fee, provided copies are not distributed for profit or commercial advantage and include the full citation.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
MM4ST.png		MM4ST.png
Outline.png		Outline.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM4ST: Multimodal Learning for Spatio-Temporal Data Mining

Tutorial Overview

Tutorial Details

Tutorial Outline

Part 1: Background & Examples

Part 2: Foundation of ST Data

Part 3: Why Multimodal ST Data Fusion

Part 4: Principle of ST Multimodal Fusion

Machine Learning Era Fusion Methods

Deep Learning Era Fusion Methods

Part 5: Visual/Language Knowledge Transfer

Subpart 5.1: Language-enhanced Spatio-Temporal Analysis

Subpart 5.2: Vision-enhanced Spatio-Temporal Analysis

Part 6: Conclusions & Future Directions

Organizers

Related Resources

Previous Tutorials by Organizers

Surveys by Organizers

Key Papers by Organizers

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MM4ST: Multimodal Learning for Spatio-Temporal Data Mining

Tutorial Overview

Tutorial Details

Tutorial Outline

Part 1: Background & Examples

Part 2: Foundation of ST Data

Part 3: Why Multimodal ST Data Fusion

Part 4: Principle of ST Multimodal Fusion

Machine Learning Era Fusion Methods

Deep Learning Era Fusion Methods

Part 5: Visual/Language Knowledge Transfer

Subpart 5.1: Language-enhanced Spatio-Temporal Analysis

Subpart 5.2: Vision-enhanced Spatio-Temporal Analysis

Part 6: Conclusions & Future Directions

Organizers

Related Resources

Previous Tutorials by Organizers

Surveys by Organizers

Key Papers by Organizers

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages