Skip to content

cari-teknisi/scrape-web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Social Media Scraping Application

A comprehensive Laravel-based web application for scraping social media platforms using Python as an external scraping engine. This application provides a robust, scalable, and user-friendly interface for extracting data from popular social media platforms.

πŸš€ Features

Supported Platforms

  • Twitter/X - User profiles, tweets, hashtags
  • Instagram - User profiles, posts, hashtags
  • Facebook - Page posts, group posts
  • LinkedIn - User profiles, company posts
  • TikTok - User profiles, videos, hashtags
  • YouTube - Channel videos, search results, comments

Core Features

  • Multi-Platform Support - Scrape data from 6 major social media platforms
  • Anti-Bot Measures - User agent rotation, proxy support, random delays
  • Modular Architecture - Clean separation between Laravel backend and Python scraping engine
  • Real-time Logging - Comprehensive logging system for monitoring scraping operations
  • Database Storage - MySQL database for storing scraping results and metadata
  • Export Options - Export results in JSON, CSV, or Excel formats
  • Rate Limiting - Built-in rate limiting to respect platform policies
  • Error Handling - Robust error handling and retry mechanisms

πŸ—οΈ Architecture

Laravel Backend (PHP)
β”œβ”€β”€ Controllers
β”‚   β”œβ”€β”€ SocialMediaController.php
β”‚   └── WebScrapingController.php
β”œβ”€β”€ Services
β”‚   β”œβ”€β”€ SocialMediaScrapingService.php
β”‚   └── PythonScrapingService.php
└── Routes & Views

Python Scraping Engine
β”œβ”€β”€ scrapers/
β”‚   β”œβ”€β”€ social_media_scrapers.py
β”‚   β”œβ”€β”€ base_scraper.py
β”‚   β”œβ”€β”€ requests_scraper.py
β”‚   β”œβ”€β”€ selenium_scraper.py
β”‚   β”œβ”€β”€ playwright_scraper.py
β”‚   └── scrapy_spider.py
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ logger.py
β”‚   └── anti_bot.py
β”œβ”€β”€ database/
β”‚   └── mysql_handler.py
└── main.py

πŸ“‹ Requirements

System Requirements

  • PHP 8.2+
  • Python 3.8+
  • MySQL 8.0+
  • Composer
  • Node.js & NPM

Python Dependencies

  • requests
  • beautifulsoup4
  • selenium
  • playwright
  • scrapy
  • pandas
  • fake-useragent
  • loguru
  • pymysql

πŸ› οΈ Installation

1. Clone the Repository

git clone <repository-url>
cd scrape-web

2. Install PHP Dependencies

composer install

3. Install Node.js Dependencies

npm install

4. Setup Environment

cp .env.example .env
php artisan key:generate

5. Configure Database

Update your .env file with database credentials:

DB_CONNECTION=mysql
DB_HOST=127.0.0.1
DB_PORT=3306
DB_DATABASE=laravel
DB_USERNAME=laravel
DB_PASSWORD=laravel

PYTHON_PATH=python3

6. Setup Python Environment

cd python_scraper
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

7. Install Browser Drivers

# For Selenium
pip install webdriver-manager

# For Playwright
playwright install

8. Run Database Migrations

php artisan migrate

9. Start the Application

# Start Laravel development server
php artisan serve

# In another terminal, start the queue worker (optional)
php artisan queue:work

🎯 Usage

Web Interface

  1. Access the Application

    • Navigate to http://localhost:8000
    • You'll be redirected to the social media scraping dashboard
  2. Scrape User Profiles

    • Go to "Profile Scraping"
    • Select platform (Twitter, Instagram, etc.)
    • Enter username
    • Click "Execute Now" or "Schedule"
  3. Scrape Content

    • Go to "Content Scraping"
    • Select platform and content type
    • Enter target (username, hashtag, etc.)
    • Set maximum items to scrape
    • Execute or schedule the task

API Usage

Scrape Social Media Profile

curl -X POST http://localhost:8000/social-media/execute-profile \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "twitter",
    "username": "elonmusk"
  }'

Scrape Social Media Content

curl -X POST http://localhost:8000/social-media/execute-content \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "twitter",
    "content_type": "tweets",
    "target": "elonmusk",
    "max_items": 20
  }'

Python Script Usage

Scrape Profile

cd python_scraper
python main.py --action scrape_profile --platform twitter --username elonmusk

Scrape Content

python main.py --action scrape_content --platform twitter --content_type tweets --target elonmusk --max_items 20

πŸ“Š Database Schema

Tables

  • scraping_tasks - Task metadata and status
  • scraping_results - Scraped data results
  • scraping_errors - Error logs
  • scraping_logs - Detailed operation logs

πŸ”§ Configuration

Environment Variables

# Python Configuration
PYTHON_PATH=python3

# Database Configuration
DB_HOST=localhost
DB_PORT=3306
DB_DATABASE=laravel
DB_USERNAME=laravel
DB_PASSWORD=laravel

# Scraping Configuration
SCRAPING_ANTI_BOT=true
SCRAPING_USE_PROXIES=false
SCRAPING_MAX_ITEMS=100
SCRAPING_MAX_EXECUTION_TIME=300

# Platform-specific settings
SCRAPING_ALLOWED_PLATFORMS=twitter,instagram,facebook,linkedin,tiktok,youtube

Platform-Specific Settings

Each platform has specific rate limits and requirements:

  • Twitter: 300 requests per 15 minutes
  • Instagram: 200 requests per hour
  • Facebook: 200 requests per hour
  • LinkedIn: 100 requests per day
  • TikTok: 1000 requests per hour
  • YouTube: 10,000 requests per day

πŸ›‘οΈ Anti-Bot Measures

The application includes several anti-bot measures:

  • User Agent Rotation - Random user agents for each request
  • Proxy Support - Optional proxy rotation
  • Random Delays - Configurable delays between requests
  • Session Rotation - Automatic session rotation
  • Rate Limiting - Built-in rate limiting per platform

πŸ“ Logging

The application provides comprehensive logging:

  • Console Logs - Real-time operation logs
  • File Logs - Persistent log files with rotation
  • Database Logs - Structured logging in database
  • Error Tracking - Detailed error logging and tracking

🚨 Rate Limiting & Ethics

Rate Limiting

  • Respects platform-specific rate limits
  • Configurable delays between requests
  • Automatic retry with exponential backoff

Ethical Considerations

  • Only scrape publicly available data
  • Respect robots.txt files
  • Implement reasonable delays
  • Monitor for rate limiting responses
  • Use the application responsibly

πŸ” Monitoring & Analytics

Dashboard Features

  • Real-time scraping statistics
  • Platform-specific metrics
  • Success/failure rates
  • Recent tasks overview
  • Environment health checks

Export Options

  • JSON - Structured data export
  • CSV - Spreadsheet-friendly format
  • Excel - Advanced formatting options

πŸ› Troubleshooting

Common Issues

  1. Python Not Found

    # Set correct Python path in .env
    PYTHON_PATH=/usr/bin/python3
  2. Dependencies Missing

    cd python_scraper
    pip install -r requirements.txt
  3. Browser Drivers

    # Install Playwright browsers
    playwright install
    
    # Install ChromeDriver
    pip install webdriver-manager
  4. Database Connection

    # Check database configuration
    php artisan config:cache
    php artisan migrate:status

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This application is for educational and research purposes only. Users are responsible for:

  • Complying with platform Terms of Service
  • Respecting rate limits and robots.txt
  • Using the application ethically and legally
  • Obtaining necessary permissions for data collection

πŸ†˜ Support

For support and questions:

  • Check the troubleshooting section
  • Review the logs for error details
  • Ensure all dependencies are installed
  • Verify environment configuration

Note: This application is designed for legitimate data collection and research purposes. Always respect platform policies and use responsibly.

About

A powerful Laravel web application that delivers an intuitive, user-friendly interface for advanced social media scraping using the robusts Python library. Effortlessly collect and analyze data from multiple social media platforms through a seamless and efficient web experience.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors