Offline Data Preprocessing

Our library allows users to perform standalone data preprocessing, independent of tuning/training. This enables users to process raw datasets, apply basic/advanced data preprocessing, and save the train and validation datasets in Parquet format inside the specified TrainingArguments.output_dir. When the --do_dataprocessing_only argument is specified, the datasets are divided and saved into multiple shards configured via (--num_train_dataset_shards and --num_eval_dataset_shards).

Users can pass any data config or data arguments. The main goal here is to take the provided data config and generate a dataset that can be used directly for training, without requiring any online processing. As an example see this data config below:

dataprocessor:
    type: default
    sampling_stopping_strategy: first_exhausted
    seed: 66
datasets:
  - name: dataset_1
    data_paths:
      - tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl
    data_handlers:
      - name: tokenize_and_apply_input_masking
        arguments:
          remove_columns: all
          batched: false
          fn_kwargs:
            input_field_name: input
            output_field_name: output

After preparing the data configuration YAML file, run with the following example command to perform offline data preprocessing:

python -m tuning.sft_trainer \
--data_config_path  /path/to/data_config.yaml \
--model_name_or_path "model_name"  \
--max_seq_length 4096 \
--output_dir /path/to/output/directory  \
--log_level info \
--num_train_dataset_shards 10 \
--num_eval_dataset_shards 1 \
--do_dataprocessing_only

Additionally, once the offline data processing is complete, users can leverage the shards stored in output_dir for tuning by passing it through the --training_data_path flag or passing it via data_paths argument in data config yaml, provided they find the sharded datasets beneficial for training.

NOTE: The offline data preprocessing is not compatible with processing image datasets for vision models.

Example Usage

Applying Chat Template

This is a sample use case of the offline processing being applied to a dataset with a chat template, after which the offline processed dataset is used to train a model.

In this use case, the chat template is applied to a dataset using the apply_tokenizer_chat_template handler, followed by additional data transformation handlers.

NOTE: Streaming of the dataset is not supported when running the offline data preprocessing. Therefore, in the data config, the streaming argument should either be set to False or left unassigned.

dataprocessor:
  type: default
  sampling_stopping_strategy: first_exhausted
  seed: 66
  streaming: False
  chat_template: |
   {%- for message in messages['messages'] %}
    {%- if message['role'] == 'system' %}
      {{ '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- elif message['role'] == 'user' %}
      {{ '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- elif message['role'] == 'assistant' %}
      {{ '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- elif message['role'] == 'tools' %}
      {{ '<|start_of_role|>tools<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- elif message['role'] == 'tool' %}
      {{ '<|start_of_role|>tool<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- elif message['role'] == 'documents' %}
      {{ '<|start_of_role|>documents<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}
    {%- else %}
      {{ '<|start_of_role|>unknown<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} 
    {%- endif %}
   {%- endfor %}
datasets:
 - name: dataset_1
   data_paths:
    - "/app/arb30_100.jsonl"
   data_handlers:
    - name: apply_tokenizer_chat_template
      arguments:
        fn_kwargs:
          dataset_text_field: "formatted_chat"
    - name: tokenize
      arguments:
        batched: false
        fn_kwargs:
          dataset_text_field: "formatted_chat"
          truncation: False
          max_length: 4096
    - name: skip_large_text
      arguments:
        fn_kwargs:
          column_name: "input_ids"
          max_length: 4096
    - name: retain_columns
      arguments:
        columns:
        - "formatted_chat"

Command to run the offline data processing:

python -m tuning.sft_trainer \
--data_config_path "data_config.yaml" \
--instruction_template "<|start_of_role|>user<|end_of_role|>" \
--max_seq_length "8192" \
--model_name_or_path "/test/models/granite-3.1-8b-instruct" \
--output_dir "/test/data/offline_processing_shards" \
--packing "False" \
--response_template "<|start_of_role|>assistant<|end_of_role|>" \
--split_batches "true" \
--use_flash_attn "true" \
--num_train_dataset_shards "10" \
--do_dataprocessing_only

The resulting shards are saved in the directory /test/data/offline_processing_shards, as specified by the --output_dir argument. These shards can then be used for tuning the model by pointing the training_data_path argument to the directory where the shards are stored—in this example, /test/data/offline_processing_shards.

Command to run the tuning:

accelerate launch \
  --num_processes=8 \
  --dynamo_backend="no" \
  --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \
  --fsdp_cpu_ram_efficient_loading="true" \
  --fsdp_forward_prefetch="false" \
  --fsdp_offload_params="false" \
  --fsdp_sharding_strategy="HYBRID_SHARD" \
  --fsdp_state_dict_type="FULL_STATE_DICT" \
  --fsdp_sync_module_states="true" \
  --machine_rank="${RANK}" \
  --main_process_ip="${MASTER_ADDR}" \
  --main_process_port="${MASTER_PORT}" \
  --mixed_precision="no" \
  --num_machines="${WORLD_SIZE}" \
  --rdzv_backend="static" \
  --same_network \
  --use_fsdp \
  -m tuning.sft_trainer \
  --training_data_path "/test/data/offline_processing_shards" \
  --adam_beta1="0.9" \
  --adam_beta2="0.98" \
  --adam_epsilon="1e-10" \
  --aim_repo="${AIMSTACK_DB}" \
  --dataloader_drop_last="true" \
  --dataset_text_field="random" \
  --evaluation_strategy="no" \
  --experiment="train-nb-g8b-r26-e0e88b40-dbd8-41ae-a744-c853959495f2" \
  --gradient_accumulation_steps="1" \
  --gradient_checkpointing="true" \
  --include_tokens_per_second="false" \
  --instruction_template="<|start_of_role|>user<|end_of_role|>" \
  --learning_rate="1e-06" \
  --logging_steps="1" \
  --logging_strategy="steps" \
  --lr_scheduler_type="cosine" \
  --max_seq_length="8192" \
  --max_steps="12400" \
  --model_name_or_path="/test/models/granite-3.1-8b-instruct" \
  --num_train_epochs="3" \
  --optim="adamw_torch" \
  --output_dir="/hfcache/data_mixing/data_mixing/wca_summ/run26_rb_mix" \
  --packing="False" \
  --per_device_train_batch_size="32" \
  --response_template="<|start_of_role|>assistant<|end_of_role|>" \
  --save_steps="100" \
  --save_strategy="steps" \
  --split_batches="true" \
  --torch_dtype="bfloat16" \
  --use_flash_attn="true" \
  --use_reentrant="true" \
  --warmup_ratio="0.1" \
  --warmup_steps="200" \
  --weight_decay="0.1" \
  --do_dataprocessing_only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline Data Preprocessing

Example Usage

Applying Chat Template

FilesExpand file tree

offline-data-preprocessing.md

Latest commit

History

offline-data-preprocessing.md

File metadata and controls

Offline Data Preprocessing

Example Usage

Applying Chat Template