diff --git a/docs/assemblies_tutorial.rst b/docs/assemblies_tutorial.rst new file mode 100644 index 0000000..4e856cd --- /dev/null +++ b/docs/assemblies_tutorial.rst @@ -0,0 +1,231 @@ +Understanding Assemblies in LITcoder +===================================== + +An **Assembly** is the core data structure in LITcoder that organizes and manages brain imaging data, stimuli, and metadata for encoding model training. It's the foundation that everything else builds upon. + +What is an Assembly? +------------------- + +An assembly is a structured container that holds all the data needed to train encoding models: + +- **Brain Data**: Recordings aligned with stimuli +- **Stimuli**: Text or audio stimuli presented during the experiment +- **Timing Information**: Precise timing of when each stimulus was presented +- **Split Indices**: Maps each word/stimulus to its corresponding TR (time repetition) +- **Metadata**: Story names, subject information, and experimental parameters + +Think of an assembly as a well-organized database that contains everything needed to train a brain encoding model. + +Assembly Structure +----------------- + +An assembly contains several key components: + +**Stories**: List of story/run names + Each story represents a continuous experimental session (e.g., listening to a story) + +**Story Data**: Dictionary mapping story names to their data + Contains brain data, stimuli, timing, and metadata for each story + +**Timing Information**: + - `tr_times`: When each TR (time repetition) occurred + - `data_times`: Precise timing for each data point (word-level) + - `split_indices`: Maps each word to its corresponding TR + +**Brain Data**: + - Preprocessed fMRI data aligned with stimuli + - Shape: (n_timepoints, n_voxels/vertices) + +Working with Assemblies +----------------------- + +Let's explore how to work with assemblies using the LeBel assembly: + +.. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + + # Load the pre-packaged LeBel assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + + # Basic information + print(f"Assembly shape: {assembly.shape}") + print(f"Stories: {assembly.stories}") + print(f"Validation method: {assembly.get_validation_method()}") + +Key Assembly Methods +------------------- + +Here are the most important methods for working with assemblies: + +**Data Access**: +- `get_stimuli()`: Get text stimuli for each story +- `get_brain_data()`: Get brain data for each story +- `get_split_indices()`: Get word-to-TR mapping +- `get_tr_times()`: Get TR timing information +- `get_data_times()`: Get precise word-level timing + +**Story-Specific Data**: +- `get_temporal_baseline(story_name)`: Get temporal baseline features +- `get_audio_path()`: Get audio file paths (for speech models) +- `get_words()`: Get individual words for each story +- `get_word_rates()`: Get pre-computed word rates + +**Metadata**: +- `get_validation_method()`: Get validation strategy ("inner" or "outer") +- `stories`: List of story names +- `story_data`: Dictionary of story-specific data + +Exploring Assembly Contents +--------------------------- + +Let's examine what's inside an assembly: + +.. code-block:: python + + # Load assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + + # Basic information + print("=== Assembly Overview ===") + print(f"Total presentations: {assembly.shape[0]}") + print(f"Number of voxels/vertices: {assembly.shape[1]}") + print(f"Stories: {assembly.stories}") + print(f"Validation method: {assembly.get_validation_method()}") + + # Explore each story + print("\n=== Story Details ===") + for story in assembly.stories: + story_data = assembly.story_data[story] + print(f"\nStory: {story}") + print(f" Brain data shape: {story_data.brain_data.shape}") + print(f" Number of stimuli: {len(story_data.stimuli)}") + print(f" Split indices: {len(story_data.split_indices)} words") + print(f" TR times: {len(story_data.tr_times)} TRs") + print(f" Data times: {len(story_data.data_times)} words") + + # Show first few stimuli + print(f" First 3 stimuli: {story_data.stimuli[:3]}") + + # Show split indices (these map words to TRs) + print(f" First 10 split indices: {story_data.split_indices[:10]}") + print(f" Last 10 split indices: {story_data.split_indices[-10:]}") + +Understanding the Data Flow +--------------------------- + +Here's how data flows through an assembly: + +1. **Stimuli Extraction**: Text is processed into features (embeddings, word rates, etc.) +2. **Timing Alignment**: Features are aligned with brain data using timing information +3. **Downsampling**: High-resolution features are downsampled to match brain data TR +4. **FIR Delays**: Temporal delays are applied to account for hemodynamic response +5. **Train/Test Split**: Data is split for proper evaluation + +Assembly Attributes +------------------- + +An assembly has several key attributes: + +**Shape**: (n_presentations, n_voxels/vertices) + Total number of timepoints and brain regions + +**Stories**: List of story names + Each story represents a continuous experimental session + +**Story Data**: Dictionary of story-specific data + Contains all the data for each story + +**Coordinates**: Metadata about presentations + Story IDs, stimulus IDs, etc. + +**Validation Method**: "inner" or "outer" + How the assembly handles train/test splits + +Working with Story Data +----------------------- + +Each story in an assembly contains: + +.. code-block:: python + + # Get data for a specific story + story_name = assembly.stories[0] + story_data = assembly.story_data[story_name] + + print(f"Story: {story_name}") + print(f" Brain data: {story_data.brain_data.shape}") + print(f" Stimuli: {len(story_data.stimuli)}") + print(f" Split indices: {len(story_data.split_indices)}") + print(f" TR times: {len(story_data.tr_times)}") + print(f" Data times: {len(story_data.data_times)}") + + # Access specific data + brain_data = story_data.brain_data + stimuli = story_data.stimuli + split_indices = story_data.split_indices + tr_times = story_data.tr_times + data_times = story_data.data_times + +Using Assemblies in Training +---------------------------- + +Here's how assemblies are used in the training pipeline: + +.. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + from encoding.features.factory import FeatureExtractorFactory + from encoding.downsample.downsampling import Downsampler + from encoding.models.nested_cv import NestedCVModel + from encoding.trainer import AbstractTrainer + + # 1. Load assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + + # 2. Create feature extractor + extractor = FeatureExtractorFactory.create_extractor( + modality="wordrate", + model_name="wordrate", + config={}, + cache_dir="cache", + ) + + # 3. Set up other components + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") + + # 4. Configure training parameters + fir_delays = [1, 2, 3, 4] + trimming_config = { + "train_features_start": 10, + "train_features_end": -5, + "train_targets_start": 0, + "train_targets_end": None, + "test_features_start": 50, + "test_features_end": -5, + "test_targets_start": 40, + "test_targets_end": None, + } + + # 5. Create trainer + trainer = AbstractTrainer( + assembly=assembly, + feature_extractors=[extractor], + downsampler=downsampler, + model=model, + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-tutorial", + dataset_type="lebel", + results_dir="results", + ) + + # 6. Train the model + metrics = trainer.train() + print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") + + +This understanding of assemblies is crucial for effectively using LITcoder. The assembly serves as the foundation for all encoding model training, providing the structured interface between your experimental data and the machine learning pipeline. diff --git a/docs/index.rst b/docs/index.rst index 0889095..7ce6243 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,6 +10,16 @@ Welcome to litcoder's documentation! installation quickstart +.. toctree:: + :maxdepth: 2 + :caption: Tutorials + + assemblies_tutorial + tutorial_wordrate + tutorial_language_models + tutorial_speech + tutorial_embeddings + .. toctree:: :maxdepth: 2 :caption: Guides @@ -33,4 +43,4 @@ Indices and tables * :ref:`genindex` * :ref:`modindex` -* :ref:`search` \ No newline at end of file +* :ref:`search` diff --git a/docs/quickstart.rst b/docs/quickstart.rst index ca4c5f3..8723996 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -1,43 +1,83 @@ Quickstart ========== -This minimal example wires an assembly, feature extractor, downsampler, model, and the trainer. +This minimal example shows how to train an encoding model using the LeBel assembly with word rate features. This is the simplest and fastest way to get started with LITcoder. .. code-block:: python - from encoding.assembly.assemblies import NarrativesAssembly + from encoding.assembly.assembly_loader import load_assembly from encoding.features.factory import FeatureExtractorFactory from encoding.downsample.downsampling import Downsampler - from encoding.models.ridge_regression import RidgeRegressionModel + from encoding.models.nested_cv import NestedCVModel from encoding.trainer import AbstractTrainer - # 1) Load data (example: Narratives-style assembly) - assembly = NarrativesAssembly(assembly_path="/path/to/narratives.h5") + # 1) Load prepackaged assembly + assembly_path = "assembly_lebel_uts03.pkl" + assembly = load_assembly(assembly_path) - # 2) Configure features (e.g., language model embeddings) - extractor = FeatureExtractorFactory.create_language_model( - model_name="gpt2", context_type="fullcontext", last_token=True + # 2) Configure components (wordrate-only) + extractor = FeatureExtractorFactory.create_extractor( + modality="wordrate", + model_name="wordrate", + config={}, + cache_dir="cache", ) - # 3) Downsampler - downsampler = Downsampler(method="linear") + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") - # 4) Model - model = RidgeRegressionModel(n_alphas=20) + # FIR, downsampling, and trimming match our LeBel defaults + fir_delays = [1, 2, 3, 4] + trimming_config = { + "train_features_start": 10, "train_features_end": -5, + "train_targets_start": 0, "train_targets_end": None, + "test_features_start": 50, "test_features_end": -5, + "test_targets_start": 40, "test_targets_end": None, + } - # 5) Trainer + downsample_config = {} + + # 3) Train trainer = AbstractTrainer( assembly=assembly, feature_extractors=[extractor], downsampler=downsampler, model=model, - fir_delays=[0, 1, 2, 3, 4], - trimming_config={"features_start": 5, "targets_start": 5}, - use_train_test_split=False, - dataset_type="narratives", - logger_backend="tensorboard", + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-wordrate", + dataset_type="lebel", results_dir="results", + downsample_config=downsample_config, ) metrics = trainer.train() - print("Median correlation:", metrics["median_score"]) \ No newline at end of file + print({ + "median_correlation": metrics.get("median_score", float("nan")), + "n_significant": metrics.get("n_significant"), + }) + +Prerequisites +------------- + +Before running this example, you need to: + +1. **Download the LeBel assembly**: + + .. code-block:: bash + + gdown 1q-XLPjvhd8doGFhYBmeOkcenS9Y59x64 + +2. **Install LITcoder**: + + .. code-block:: bash + + git clone git@github.com:GT-LIT-Lab/litcoder_core.git + cd litcoder_core + conda create -n litcoder -y python=3.12.8 + conda activate litcoder + conda install pip + pip install -e . + diff --git a/docs/tutorial_embeddings.rst b/docs/tutorial_embeddings.rst new file mode 100644 index 0000000..5340290 --- /dev/null +++ b/docs/tutorial_embeddings.rst @@ -0,0 +1,157 @@ +Static Embeddings Tutorial +========================== + +This tutorial shows how to train encoding models using static word embeddings with the LeBel assembly. Static embeddings provide pre-trained word representations that can be highly predictive of brain activity. + +Overview +-------- + +Static embeddings capture semantic relationships between words using pre-trained models like Word2Vec or GloVe. These embeddings provide rich semantic representations that can be highly predictive of brain activity. + +Key Components +-------------- + +- **Assembly**: Pre-packaged LeBel assembly containing brain data and stimuli +- **Feature Extractor**: StaticEmbeddingFeatureExtractor using pre-trained embeddings +- **Embedding Models**: Word2Vec, GloVe, or other static embedding models +- **Downsampler**: Aligns word-level features with brain data timing +- **Model**: Ridge regression with nested cross-validation +- **Trainer**: AbstractTrainer orchestrates the entire pipeline + +Step-by-Step Tutorial +--------------------- + +1. **Load the Assembly** + + .. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + + # Load the pre-packaged LeBel assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + +2. **Create Static Embedding Feature Extractor** + + .. code-block:: python + + from encoding.features.factory import FeatureExtractorFactory + + # You need to provide the path to your embedding file + vector_path = "/path/to/your/embeddings.bin.gz" # Replace with your path + + extractor = FeatureExtractorFactory.create_extractor( + modality="embeddings", + model_name="word2vec", # Can be "word2vec", "glove", or any identifier + config={ + "vector_path": vector_path, + "binary": True, # Set to True for .bin files, False for .txt files + "lowercase": False, # Set to True if your embeddings expect lowercase tokens + "oov_handling": "copy_prev", # How to handle out-of-vocabulary words + "use_tqdm": True, # Show progress bar + }, + cache_dir="cache", + ) + +3. **Set Up Downsampler and Model** + + .. code-block:: python + + from encoding.downsample.downsampling import Downsampler + from encoding.models.nested_cv import NestedCVModel + + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") + +4. **Configure Training Parameters** + + .. code-block:: python + + # FIR delays for hemodynamic response modeling + fir_delays = [1, 2, 3, 4] + + # Trimming configuration for LeBel dataset + trimming_config = { + "train_features_start": 10, + "train_features_end": -5, + "train_targets_start": 0, + "train_targets_end": None, + "test_features_start": 50, + "test_features_end": -5, + "test_targets_start": 40, + "test_targets_end": None, + } + + downsample_config = {} + +5. **Create and Run Trainer** + + .. code-block:: python + + from encoding.trainer import AbstractTrainer + + trainer = AbstractTrainer( + assembly=assembly, + feature_extractors=[extractor], + downsampler=downsampler, + model=model, + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-embeddings", + dataset_type="lebel", + results_dir="results", + downsample_config=downsample_config, + ) + + metrics = trainer.train() + print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") + +Understanding Static Embeddings +------------------------------- + +Key Parameters +-------------- + +- **modality**: "embeddings" - specifies the feature type +- **model_name**: "word2vec" - identifier for the extractor +- **vector_path**: Path to the embedding file +- **binary**: True for .bin files, False for .txt files +- **lowercase**: Whether to lowercase tokens before lookup +- **oov_handling**: How to handle out-of-vocabulary words +- **use_tqdm**: Whether to show progress bar +- **cache_dir**: "cache" - directory for caching + +Embedding Models +---------------- + +Supported embedding models include: +- **Word2Vec**: Google News vectors, custom Word2Vec models +- **GloVe**: Stanford GloVe embeddings +- **Custom embeddings**: Any compatible embedding format + +File Formats +------------ + +Supported file formats: +- **Binary files (.bin)**: Set `binary=True` +- **Text files (.txt)**: Set `binary=False` +- **Compressed files (.gz)**: Automatically handled + +OOV Handling +------------ + +Out-of-vocabulary (OOV) word handling strategies: +- **"copy_prev"**: Use the previous word's embedding +- **"zero"**: Use zero vector +- **"random"**: Use random vector +- **"mean"**: Use mean of all embeddings + +Choose based on your research question and data characteristics. + +Training Configuration +---------------------- + +- **fir_delays**: [1, 2, 3, 4] - temporal delays for hemodynamic response +- **trimming_config**: LeBel-specific trimming to avoid boundary effects +- **downsample_config**: {} - no additional downsampling configuration needed diff --git a/docs/tutorial_language_models.rst b/docs/tutorial_language_models.rst new file mode 100644 index 0000000..ce68b67 --- /dev/null +++ b/docs/tutorial_language_models.rst @@ -0,0 +1,159 @@ +Language Model Features Tutorial +================================ + +This tutorial shows how to train encoding models using language model features with the LeBel assembly. Language model features capture rich semantic representations from transformer models. + +Overview +-------- + +Language model features extract high-dimensional representations from transformer models like GPT-2. These features capture semantic, syntactic, and contextual information that can be highly predictive of brain activity. + +Key Components +-------------- + +- **Assembly**: Pre-packaged LeBel assembly containing brain data and stimuli +- **Feature Extractor**: LanguageModelFeatureExtractor using transformer models +- **Caching**: Multi-layer activation caching for efficient training +- **Downsampler**: Aligns word-level features with brain data timing +- **Model**: Ridge regression with nested cross-validation +- **Trainer**: AbstractTrainer orchestrates the entire pipeline + +Step-by-Step Tutorial +--------------------- + +1. **Load the Assembly** + + .. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + + # Load the pre-packaged LeBel assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + +2. **Create Language Model Feature Extractor** + + .. code-block:: python + + from encoding.features.factory import FeatureExtractorFactory + + extractor = FeatureExtractorFactory.create_extractor( + modality="language_model", + model_name="gpt2-small", # Can be changed to other models + config={ + "model_name": "gpt2-small", + "layer_idx": 9, # Layer to extract features from + "last_token": True, # Use last token only + "lookback": 256, # Context lookback + "context_type": "fullcontext", + }, + cache_dir="cache_language_model", + ) + +3. **Set Up Downsampler and Model** + + .. code-block:: python + + from encoding.downsample.downsampling import Downsampler + from encoding.models.nested_cv import NestedCVModel + + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") + +4. **Configure Training Parameters** + + .. code-block:: python + + # FIR delays for hemodynamic response modeling + fir_delays = [1, 2, 3, 4] + + # Trimming configuration for LeBel dataset + trimming_config = { + "train_features_start": 10, + "train_features_end": -5, + "train_targets_start": 0, + "train_targets_end": None, + "test_features_start": 50, + "test_features_end": -5, + "test_targets_start": 40, + "test_targets_end": None, + } + + downsample_config = {} + +5. **Create and Run Trainer** + + .. code-block:: python + + from encoding.trainer import AbstractTrainer + + trainer = AbstractTrainer( + assembly=assembly, + feature_extractors=[extractor], + downsampler=downsampler, + model=model, + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-language-model", + dataset_type="lebel", + results_dir="results", + layer_idx=9, # Pass layer_idx to trainer + lookback=256, # Pass lookback to trainer + ) + + metrics = trainer.train() + print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") + +Understanding Language Model Features +------------------------------------- + +Language model features are extracted by: + +1. **Text Processing**: Each stimulus text is tokenized and processed +2. **Transformer Forward Pass**: The model processes the text through all layers +3. **Feature Extraction**: Features are extracted from the specified layer +4. **Caching**: Multi-layer activations are cached for efficiency +5. **Downsampling**: Features are aligned with brain data timing + +Key Parameters +-------------- + +- **modality**: "language_model" - specifies the feature type +- **model_name**: "gpt2-small" - transformer model to use +- **layer_idx**: 9 - which layer to extract features from +- **last_token**: True - use only the last token's features (we recommend using this) +- **lookback**: 256 - context window size +- **context_type**: "fullcontext" - how to handle context +- **cache_dir**: "cache_language_model" - directory for caching + +Model Options +------------- + +Supported models include: +- **gpt2-small**: Fast, good baseline +- **gpt2-medium**: Better performance, slower +- **facebook/opt-125m**: Alternative architecture +- **Other TransformerLens models**: Any compatible model from `TransformerLens model properties table `_ + + +Caching System +-------------- + +The language model extractor uses a sophisticated caching system: + +1. **Multi-layer caching**: All layers are cached together +2. **Lazy loading**: Layers are loaded on-demand +3. **Efficient storage**: Compressed storage of activations +4. **Cache validation**: Ensures cached data matches parameters + +This makes it efficient to experiment with different layers without recomputing features. + +Training Configuration +---------------------- + +- **fir_delays**: [1, 2, 3, 4] - temporal delays for hemodynamic response +- **trimming_config**: LeBel-specific trimming to avoid boundary effects +- **layer_idx**: 9 - which layer to use for training +- **lookback**: 256 - context window size + diff --git a/docs/tutorial_speech.rst b/docs/tutorial_speech.rst new file mode 100644 index 0000000..2aac6d9 --- /dev/null +++ b/docs/tutorial_speech.rst @@ -0,0 +1,176 @@ +Speech Features Tutorial +======================== + +This tutorial shows how to train encoding models using speech features with the LeBel assembly. Speech features extract representations from audio using speech models. + +Overview +-------- + +Speech features capture acoustic and linguistic information from audio stimuli using speech recognition models like Whisper or HuBERT. These features can be highly predictive of brain activity during audio-based experiments. + +Key Components +-------------- + +- **Assembly**: Pre-packaged LeBel assembly containing brain data and audio paths +- **Feature Extractor**: SpeechFeatureExtractor using speech recognition models +- **Audio Processing**: Chunking and resampling of audio files +- **Caching**: Multi-layer activation caching for efficient training +- **Downsampler**: Aligns audio-level features with brain data timing +- **Model**: Ridge regression with nested cross-validation +- **Trainer**: AbstractTrainer orchestrates the entire pipeline + +Step-by-Step Tutorial +--------------------- + +1. **Load the Assembly** + + .. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + + # Load the pre-packaged LeBel assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + +2. **Set Up Audio Paths** + + .. code-block:: python + + import os + + # Set up audio paths for speech model + base_audio_path = "/path/to/your/audio/files" # Replace with your audio path + + for story_name in assembly.stories: + # Assuming audio files are named like: story_name.wav + audio_file_path = os.path.join(base_audio_path, f"{story_name}.wav") + + # Set the audio path for this story + if hasattr(assembly, "story_data") and story_name in assembly.story_data: + assembly.story_data[story_name].audio_path = audio_file_path + print(f"Set audio path for {story_name}: {audio_file_path}") + +3. **Create Speech Feature Extractor** + + .. code-block:: python + + from encoding.features.factory import FeatureExtractorFactory + + extractor = FeatureExtractorFactory.create_extractor( + modality="speech", + model_name="openai/whisper-tiny", # Can be changed to other models + config={ + "model_name": "openai/whisper-tiny", + "chunk_size": 0.1, # seconds between chunk starts (stride) + "context_size": 16.0, # seconds of audio per window + "layer": 3, # Layer index to extract features from + "pool": "last", # Pooling method: 'last' or 'mean' + "target_sample_rate": 16000, # Target sample rate for audio + "device": "cuda", # Can be "cuda", "cpu" + }, + cache_dir="cache_speech", + ) + +4. **Set Up Downsampler and Model** + + .. code-block:: python + + from encoding.downsample.downsampling import Downsampler + from encoding.models.nested_cv import NestedCVModel + + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") + +5. **Configure Training Parameters** + + .. code-block:: python + + # FIR delays for hemodynamic response modeling + fir_delays = [1, 2, 3, 4] + + # Trimming configuration for LeBel dataset + trimming_config = { + "train_features_start": 10, + "train_features_end": -5, + "train_targets_start": 0, + "train_targets_end": None, + "test_features_start": 50, + "test_features_end": -5, + "test_targets_start": 40, + "test_targets_end": None, + } + + downsample_config = {} + +6. **Create and Run Trainer** + + .. code-block:: python + + from encoding.trainer import AbstractTrainer + + trainer = AbstractTrainer( + assembly=assembly, + feature_extractors=[extractor], + downsampler=downsampler, + model=model, + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-speech-model", + dataset_type="lebel", + results_dir="results", + layer_idx=3, # Pass layer_idx to trainer + ) + + metrics = trainer.train() + print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") + + +You usually would not need to change the wav path for the assembly if you generate your own assembly. But since the assembly is already generated, we need to set the wav path for each story(more detailed tutorial on this coming soon!) + +Understanding Speech Features +----------------------------- + +Speech features are extracted by: + +1. **Audio Loading**: Audio files are loaded and resampled to target sample rate +2. **Chunking**: Audio is divided into overlapping chunks for processing +3. **Model Forward Pass**: Each chunk is processed through the speech model +4. **Feature Extraction**: Features are extracted from the specified layer +5. **Pooling**: Features are pooled across time (last token or mean) +6. **Caching**: Multi-layer activations are cached for efficiency +7. **Downsampling**: Features are aligned with brain data timing + +Key Parameters +-------------- + +- **modality**: "speech" - specifies the feature type +- **model_name**: "openai/whisper-tiny" - speech model to use +- **chunk_size**: 0.1 - seconds between chunk starts (stride) +- **context_size**: 16.0 - seconds of audio per window +- **layer**: 3 - which layer to extract features from +- **pool**: "last" - pooling method ('last' or 'mean') +- **target_sample_rate**: 16000 - target sample rate for audio +- **device**: "cuda" - device to run the model on +- **cache_dir**: "cache_speech" - directory for caching + + +Caching System +-------------- + +The speech extractor uses a sophisticated caching system: + +1. **Multi-layer caching**: All layers are cached together +2. **Lazy loading**: Layers are loaded on-demand +3. **Efficient storage**: Compressed storage of activations +4. **Cache validation**: Ensures cached data matches parameters + +This makes it efficient to experiment with different layers without recomputing features. + +Training Configuration +---------------------- + +- **fir_delays**: [1, 2, 3, 4] - temporal delays for hemodynamic response +- **trimming_config**: LeBel-specific trimming to avoid boundary effects +- **layer_idx**: 3 - which layer to use for training + diff --git a/docs/tutorial_wordrate.rst b/docs/tutorial_wordrate.rst new file mode 100644 index 0000000..ed9d931 --- /dev/null +++ b/docs/tutorial_wordrate.rst @@ -0,0 +1,127 @@ +Word Rate Feature Tutorial +========================= + +This tutorial shows how to train encoding models using word rate features with the LeBel assembly. Word rate features are simple but effective baselines that measure the rate of word presentation. + +Overview +-------- + +Word rate features capture the temporal dynamics of language presentation by measuring how many words are presented per time unit. This is one of the simplest but an effective feature for brain encoding models. + +Key Components +-------------- + +- **Assembly**: Pre-packaged LeBel assembly containing brain data and stimuli +- **Feature Extractor**: WordRateFeatureExtractor for computing word presentation rates +- **Downsampler**: Aligns word-level features with brain data timing +- **Model**: Ridge regression with nested cross-validation +- **Trainer**: AbstractTrainer orchestrates the entire pipeline + +Step-by-Step Tutorial +--------------------- + +1. **Load the Assembly** + + .. code-block:: python + + from encoding.assembly.assembly_loader import load_assembly + + # Load the pre-packaged LeBel assembly + assembly = load_assembly("assembly_lebel_uts03.pkl") + +2. **Create Word Rate Feature Extractor** + + .. code-block:: python + + from encoding.features.factory import FeatureExtractorFactory + + extractor = FeatureExtractorFactory.create_extractor( + modality="wordrate", + model_name="wordrate", + config={}, + cache_dir="cache", + ) + +3. **Set Up Downsampler and Model** + + .. code-block:: python + + from encoding.downsample.downsampling import Downsampler + from encoding.models.nested_cv import NestedCVModel + + downsampler = Downsampler() + model = NestedCVModel(model_name="ridge_regression") + +4. **Configure Training Parameters** + + .. code-block:: python + + # FIR delays for hemodynamic response modeling + fir_delays = [1, 2, 3, 4] + + # Trimming configuration for LeBel dataset + trimming_config = { + "train_features_start": 10, + "train_features_end": -5, + "train_targets_start": 0, + "train_targets_end": None, + "test_features_start": 50, + "test_features_end": -5, + "test_targets_start": 40, + "test_targets_end": None, + } + + downsample_config = {} + +5. **Create and Run Trainer** + + .. code-block:: python + + from encoding.trainer import AbstractTrainer + + trainer = AbstractTrainer( + assembly=assembly, + feature_extractors=[extractor], + downsampler=downsampler, + model=model, + fir_delays=fir_delays, + trimming_config=trimming_config, + use_train_test_split=True, + logger_backend="wandb", + wandb_project_name="lebel-wordrate", + dataset_type="lebel", + results_dir="results", + downsample_config=downsample_config, + ) + + metrics = trainer.train() + print(f"Median correlation: {metrics.get('median_score', float('nan')):.4f}") + +Understanding Word Rate Features +-------------------------------- + +Word rate features are computed by: + +1. **Counting words per TR**: The assembly pre-computes word rates for each TR +2. **No additional processing needed**: Word rates are already aligned with brain data +3. **Simple but effective**: Captures temporal dynamics of language presentation + +The word rate extractor simply returns the pre-computed word rates from the assembly, making it the fastest feature type to compute. + +Key Parameters +-------------- + +- **modality**: "wordrate" - specifies the feature type +- **model_name**: "wordrate" - identifier for the extractor +- **config**: {} - no additional configuration needed +- **cache_dir**: "cache" - directory for caching (though word rates don't need caching) + +Training Configuration +---------------------- + +- **fir_delays**: [1, 2, 3, 4] - temporal delays to account for hemodynamic response +- **trimming_config**: LeBel-specific trimming to avoid boundary effects + + + +Word rate features provide an excellent foundation for understanding the LITcoder pipeline before moving to more complex feature types.