Skip to content

Conversation

@rgachuhi
Copy link

@rgachuhi rgachuhi commented Oct 2, 2025

Overview

This PR extends the datashop export functionality to include tutor log messages alongside the existing attempt_evaluated messages. The implementation processes tutor_message events from XAPI data and converts them to the appropriate XML format for datashop export.

Changes Made

1. Enhanced Data Processing Pipeline (dataset/dataset.py)

  • Added tutor message processing: Extended the generate_datashop() function to process both attempt_evaluated and tutor_message events
  • Parallel processing: Implemented chunked processing for tutor messages using the same parallel processing pattern as attempt_evaluated messages
  • Data partitioning: Added partitioning logic for tutor messages by section_id, user_id, and session_id to maintain consistency with existing data structure
  • Import updates: Added process_tutor_messages to the import statement from dataset.datashop

2. New Tutor Message Processing Functions (dataset/datashop.py)

process_tutor_message(j, lookup)

  • XML extraction: Extracts XML content from XAPI tutor_message events stored in j["result"]["message"]
  • Meta element replacement: Removes existing meta elements and generates new ones with proper user_id, session_id, time, and time_zone
  • XML sanitization: Applies proper XML escaping and formatting for datashop compatibility
  • Error handling: Includes comprehensive error handling with detailed logging

process_tutor_messages(part_attempts, context)

  • Batch processing: Processes multiple tutor messages in a batch, similar to process_part_attempts
  • Context management: Maintains proper lookup context and anonymization settings
  • Integration: Seamlessly integrates with the existing datashop processing pipeline

meta_xml(context)

  • Meta element generation: Creates properly formatted meta XML elements with user_id, session_id, time, and time_zone
  • Text sanitization: Uses existing sanitize_element_text() function for proper XML escaping

3. Enhanced Message Handling (dataset/datashop.py)

  • Multi-type support: Updated handle_datashop() to process both question activities (http://adlnet.gov/expapi/activities/question) and tutor messages (http://oli.cmu.edu/extensions/tutor_message)
  • Improved error handling: Added JSON parsing error handling and general exception handling for more robust processing
  • Code organization: Better structured conditional logic for different message types

Technical Details

Data Flow

  1. Inventory lookup: Tutor messages are retrieved from S3 inventory using the tutor_message event type
  2. Chunked processing: Messages are processed in configurable chunks for memory efficiency
  3. Parallel processing: Each chunk is processed in parallel using Spark's parallel_map
  4. Partitioning: Messages are partitioned by section_id, user_id, and session_id for consistent grouping
  5. XML conversion: Each tutor message is converted to datashop XML format with proper meta elements
  6. Export integration: Converted messages are included in the final XML export alongside attempt_evaluated messages

XML Structure

Tutor messages are converted to datashop-compatible XML with:

  • Proper meta elements containing user_id, session_id, time, and time_zone
  • Preserved original message content structure
  • Proper XML escaping and formatting
  • Consistent indentation for readability

Error Handling

  • JSON parsing errors are caught and logged without stopping processing
  • XML parsing errors are handled gracefully with detailed error messages
  • General exceptions are caught and logged with stack traces for debugging

Benefits

  • Complete data export: Datashop exports now include both student attempts and tutor interactions
  • Consistent processing: Tutor messages follow the same processing patterns as attempt_evaluated messages
  • Maintainable code: New functionality integrates seamlessly with existing architecture
  • Robust error handling: Comprehensive error handling ensures processing continues even with malformed data

Testing

The implementation maintains backward compatibility and follows existing patterns, ensuring that:

  • Existing attempt_evaluated processing continues to work unchanged
  • New tutor message processing integrates seamlessly
  • Error handling prevents processing failures from affecting other data

Files Modified

  • dataset/dataset.py: Extended datashop generation pipeline
  • dataset/datashop.py: Added tutor message processing functions and enhanced message handling

@darrensiegel darrensiegel merged commit 3836db4 into master Oct 3, 2025
1 check passed
@darrensiegel darrensiegel deleted the MER-4826-include-superactivity-log branch October 3, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants