__# GF-TADs Data Analysis and Visualization
This project provides a comprehensive solution for extracting, analyzing, and visualizing data from GF-TADs (Global Framework for the Progressive Control of Transboundary Animal Diseases) documents.
- PDF Processing: Extract text from PDF documents using multiple libraries (pdfplumber, PyPDF2)
- Entity Extraction: Identify What, When, Who, Where, Impact, and Objectives from unstructured text
- Natural Language Processing: Use spaCy and NLTK for advanced text analysis
- Confidence Scoring: Assign confidence scores to extracted information
- Temporal Analysis: Track activities and recommendations over time
- Organization Analysis: Identify key stakeholders and their involvement
- Objective Categorization: Classify activities by their strategic objectives
- Impact Assessment: Analyze the expected impact of different activities
- Interactive Dashboards: Comprehensive overview of all extracted data
- Timeline Visualizations: Track activities across meetings and years
- Word Clouds: Visual representation of key terms and concepts
- Network Analysis: Relationships between organizations and activities
- Statistical Analysis: Confidence scores and data quality metrics
- Streamlit Dashboard: User-friendly web interface for data analysis
- Real-time Processing: Extract and analyze data in real-time
- Export Capabilities: Download results in Excel, CSV, or JSON format
- PyPDF2==3.0.1
- pdfplumber==0.10.0
- pandas==2.1.4
- numpy==1.25.2
- matplotlib==3.8.2
- seaborn==0.13.0
- plotly==5.17.0
- wordcloud==1.9.2
- spacy==3.7.2
- nltk==3.8.1
- scikit-learn==1.3.2
- textblob==0.17.1
- streamlit==1.29.0
# Install spaCy English model
python -m spacy download en_core_web_sm
# NLTK data will be downloaded automatically when first run# Clone or download the project files
# Navigate to the project directory
cd path/to/gftad/project
# (Recommended) Create and activate a virtual environment (Windows):
python -m venv .venv
.venv\Scripts\activate
# Install dependencies
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
# Install spaCy model (if using spaCy features)
python -m spacy download en_core_web_smOrganize your GF-TADs documents in the following structure:
your_project_folder/
βββ GSC_Recommendations/
β βββ document1.pdf
β βββ document2.pdf
β βββ ...
βββ GSC_Reports/
β βββ report1.pdf
β βββ report2.pdf
β βββ ...
streamlit run streamlit_app.pyThen open your browser to http://localhost:8501
from data_extractor import GFTADsDataExtractor
from visualizer import GFTADsVisualizer
# Extract data
extractor = GFTADsDataExtractor("path/to/your/data/folder")
df = extractor.process_all_documents()
output_file = extractor.save_extracted_data(df, 'excel')
# Create visualizations
visualizer = GFTADsVisualizer(df=df)
visualizer.save_visualizations()from data_extractor import GFTADsDataExtractor
# Initialize extractor
extractor = GFTADsDataExtractor(r"c:\path\to\your\gftad\folder")
# Process all documents
df = extractor.process_all_documents()
# Save results
output_file = extractor.save_extracted_data(df, format='excel')
print(f"Data saved to: {output_file}")from visualizer import GFTADsVisualizer
# Create visualizer
visualizer = GFTADsVisualizer("path/to/extracted_data.xlsx")
# Generate overview dashboard
overview = visualizer.create_overview_dashboard()
overview.show()
# Create timeline visualization
timeline = visualizer.create_activity_timeline()
timeline.show()
# Save all visualizations
visualizer.save_visualizations("output/folder")# Generate comprehensive summary
report = visualizer.generate_summary_report()
print(f"Total activities: {report['total_activities']}")
print(f"Average confidence: {report['avg_confidence']:.2f}")The system generates several types of output files:
gftads_extracted_data_YYYYMMDD_HHMMSS.xlsx- Main data file with all extracted informationgftads_extracted_data_YYYYMMDD_HHMMSS.csv- CSV version for broader compatibilitygftads_extracted_data_YYYYMMDD_HHMMSS.json- JSON format for programmatic access
overview_dashboard_YYYYMMDD_HHMMSS.html- Interactive overview dashboardactivity_timeline_YYYYMMDD_HHMMSS.html- Timeline visualizationobjectives_analysis_YYYYMMDD_HHMMSS.html- Objectives analysisconfidence_analysis_YYYYMMDD_HHMMSS.html- Confidence score analysiswordclouds_YYYYMMDD_HHMMSS.png- Word cloud visualizationssummary_report_YYYYMMDD_HHMMSS.json- Comprehensive summary report
You can customize the keywords used for entity extraction by modifying the setup_keywords() method in the GFTADsDataExtractor class:
self.keywords = {
'time_indicators': ['by', 'before', 'after', 'during', ...],
'action_verbs': ['develop', 'implement', 'establish', ...],
'organizations': ['FAO', 'OIE', 'WHO', ...],
# Add your custom keywords here
}Modify the calculate_confidence_score() method to adjust how confidence scores are calculated based on your specific requirements.
-
PDF Extraction Fails
- Ensure PDFs are not password-protected
- Try different PDF processing libraries if one fails
-
spaCy Model Not Found
python -m spacy download en_core_web_sm
-
Memory Issues with Large PDFs
- Process documents in smaller batches
- Increase system memory or use a machine with more RAM
-
Poor Entity Extraction Quality
- Adjust keywords in the configuration
- Improve preprocessing steps
- Consider using more advanced NLP models
- Parallel Processing: For large document collections, consider implementing parallel processing
- Caching: Use caching for repeated analyses
- Preprocessing: Clean and preprocess text for better extraction quality
You can extend the entity extraction by creating custom extraction methods:
def extract_custom_entities(self, text):
# Your custom extraction logic here
return extracted_entitiesThe extracted data can be easily integrated with:
- Business Intelligence tools (Power BI, Tableau)
- Database systems (SQL Server, PostgreSQL)
- Web applications and APIs
- Machine learning pipelines
Feel free to contribute to this project by:
- Adding new visualization types
- Improving entity extraction algorithms
- Enhancing the user interface
- Adding support for additional document formats
This project is provided as-is for educational and research purposes.
For questions or issues, please check the troubleshooting section above or refer to the inline documentation in the code files.