Skip to content

CityMind-Lab/MM4ST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

MM4ST: Multimodal Learning for Spatio-Temporal Data Mining

This GitHub repository hosts materials for the ACM MM '25 Tutorial titled "Multimodal Learning for Spatio-Temporal Data Mining (MM4ST)". All resources (slides, references) are publicly available to support attendees and researchers interested in multimodal spatio-temporal data mining.

Tutorial Overview

Spatio-temporal data mining (STDM) is increasingly critical in multimedia, fueled by multimodal data from remote sensing, IoT sensors, social media, and mobile devices. Traditional single-modal methods fail to capture real-world complexity, while multimodal integration enables richer, more accurate insights. This half-day tutorial provides a comprehensive overview of STDM fundamentals, multimodal data challenges, advanced modeling techniques, and future research directions—equipping attendees to build scalable, robust spatio-temporal mining solutions.

Tutorial Details

Tutorial Outline

The tutorial is structured into four core parts:

Part 1: Background & Examples

  • Real-world application scenarios and value of STDM:
    • Geological disaster response (rainfall data → disaster prediction → precautionary measures).
    • Cloud resource autoscaling (user load data → future load prediction → scaling decisions).
    • Navigation optimization (traffic flow data → congestion identification → path selection).
    • New materials/drugs design (molecular structure data → property prediction → candidate screening).
  • Core goal: Address "big challenges in big cities" via ST data modeling (win-win-win for environment, people, and cities).

Part 2: Foundation of ST Data

  • Definition of Spatio-Temporal (ST) Data: Integrates spatial (location), temporal (time), and event-related information to capture dynamic phenomena.
  • ST Data Intelligence Framework: From urban sensing/data acquisition → data management → analytics → application (urban planning, traffic relief, pollution reduction).
  • ST Data Taxonomy:
    • Spatially and temporally static data (e.g., POI distributions, road networks).
    • Spatially static and temporally dynamic data (e.g., weather/AQI station data, traffic flow).
    • Spatially and temporally dynamic data (e.g., human mobility trajectories, animal migration).
  • Key data types/sources: Geographical, traffic, social media, demographic, environment data, etc.

Part 3: Why Multimodal ST Data Fusion

  • Limitations of single-modality methods: Inability to address real-world complexity (e.g., incomplete information, lack of causality).
  • Value of multimodal fusion: Complementary data sources boost accuracy and context (e.g., air quality inference with AQI, traffic, land use data; noise diagnosis with POIs, road networks, check-in data).
  • Research gap: Traditional multimodal learning focuses on single-domain, aligned data—fails in cross-domain ST scenarios.
  • Cross-domain data fusion definition: Integrating data from different domains (collected for different problems, originally unaligned) to extract knowledge.
  • Future trend: Multimodal ST learning shifts from solving digital-world problems to physical-world challenges (e.g., AQI inference, urban management).

Part 4: Principle of ST Multimodal Fusion

Machine Learning Era Fusion Methods

  • Stage-based data fusion.
  • Feature-level-based data fusion (feature concatenation + regularization, DNN-based).
  • Semantic meaning-based fusion (multi-view learning, similarity-based, probabilistic dependency-based, transfer learning-based).

Deep Learning Era Fusion Methods

  • Core fusion paradigms:
    • Feature-based fusion (feature addition/multiplication, concatenation, graph-based fusion).
    • Alignment-based fusion (cross-attention, encoder-based alignment).
    • Contrast-based fusion (multimodal contrastive learning, e.g., CLIP, UrbanCLIP).
    • Generation-based fusion (autoregression, masked modeling, diffusion-based, e.g., GeoMAN, DiffSTG).
  • Key design trade-offs and real-world application adaptations.

Part 5: Visual/Language Knowledge Transfer

Subpart 5.1: Language-enhanced Spatio-Temporal Analysis

  • Advantages of language integration: Provides context, interpretability, and robustness (addresses traditional methods’ limitations).
  • Core challenges: Data heterogeneity (numerical TS vs. discrete text), temporal alignment asynchrony, noise/irrelevant context.
  • Fusion stages:
    • Input level: Unify TS and text via prompts (e.g., Time-MQA, Time-LLM).
    • Intermediate level: Aggregate embeddings (mean/concatenation) + alignment (self-attention, cross-attention, gating, graph convolution, contrastive learning).
    • Output level: Project multimodal outputs to a unified space.

Subpart 5.2: Vision-enhanced Spatio-Temporal Analysis

  • Advantages of vision models: Continuous pixel sequences, support for multivariate TS, compact long-TS encoding, intuitive understanding.
  • TS-to-Vision Transformation Methods (8 major types):
    • Line Plot: Matches human perception (suitable for UTS/MTS with few variates).
    • Heatmap (UVH/MVH): Straightforward for UTS/MTS (e.g., TimEHR, TimesNet, TimeMixer++).
    • Spectrogram (STFT/Wavelet/Filterbank): Encodes time-frequency space (for UTS, e.g., audio TS analysis).
    • GAF: Encodes temporal correlations (e.g., anomaly detection with CNNs).
    • RP (Recurrent Plot): Captures periodicity/chaos (flexible image size via tuning m/τ).
  • Key findings: Self-supervised vision models (MAE/SimMIM) outperform supervised ones; heatmaps/GAFs excel in TSC/TSF; decoder components are critical for prediction.

Part 6: Conclusions & Future Directions

  • Core takeaways: Multimodal ST fusion unlocks physical-world problem-solving; language/vision modalities complement traditional TS models.
  • Future research directions:
    • Enhance vision encoders for TSF (e.g., distillation) and mitigate period-based imaging bias.
    • Optimize TS-to-vision mapping to resolve information density misalignment.
    • Reduce computational costs via compression/efficient attention.
    • Explore multimodal TS analysis with VLM Agents.
    • Advance cross-domain knowledge transfer in ST multimodal learning.

Organizers

Name Affiliation Email
Siru Zhong The Hong Kong University of Science and Technology (Guangzhou) szhong691@connect.hkust-gz.edu.cn
Xixuan Hao The Hong Kong University of Science and Technology (Guangzhou) xhao390@connect.hkust-gz.edu.cn
Hao Miao Aalborg University haom@cs.aau.dk
Yan Zhao University of Electronic Science and Technology of China zhaoyan@uestc.edu.cn
Qingsong Wen Squirrel Ai Learning qingsongedu@gmail.com
Roger Zimmermann National University of Singapore rogerz@comp.nus.edu.sg
Yuxuan Liang (Corresponding) The Hong Kong University of Science and Technology (Guangzhou) yuxliang@outlook.com

Related Resources

Previous Tutorials by Organizers

  • WWW’25: Web-Centric Human Mobility Analytics (Sydney, Australia)
  • KDD’24: Foundation Models for Time Series (Barcelona, Spain)
  • ICDM’23: Time Series Analysis (Shanghai, China)
  • KDD’22: Robust Time Series Analysis and Applications (Washington DC, USA)

Surveys by Organizers

  • Service Route and Time Prediction in Instant Delivery (TKDE’24)
  • Time Series Transformers (IJCAI’23, Paper Digest Most Influential Paper)
  • Time Series Data Augmentation (IJCAI’21, Paper Digest Most Influential Paper)
  • Self-supervised Learning and LLMs for Time Series (TPAMI’24, IJCAI’24)
  • Cross-Domain Data Fusion in Urban Computing (Inf. Fusion’24)

Key Papers by Organizers

  • Multimodal ST Learning: UrbanVLP (AAAI’25), SatCLE (WWW’25), UrbanCross (MM’24)
  • Applications: Airadar (AAAI’25), DynST (KDD’25), PatchSTG (KDD’25)
  • Time Series Analysis: FEDformer (ICML’22), AirFormer (AAAI’23), Time-LLM (ICLR’24)

License

All materials are provided under the ACM Copyright License. Personal or classroom use is permitted without fee, provided copies are not distributed for profit or commercial advantage and include the full citation.

About

ACM MM'25 Tutorial: Multimodal Learning for Spatio-Temporal Data Mining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors