Skip to content

HarrisonCaetanoCandido/proxy-scraper-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

In today’s rapidly evolving technological landscape, building advanced web scraping tools is essential for extracting valuable data efficiently. Your challenge is to design an undetectable web scraper capable of bypassing Shopee’s anti-scraping mechanisms. Additionally, the scraper should be scalable, and capable of handling high volumes of data without compromising performance.

Task Description

Your challenge is to develop a scalable, undetectable API capable of retrieving detailed product data from Shopee Taiwan's item detail pages. The API must scrape and return the full JSON response from Shopee’s official get_pc or get_rw APIs while effectively bypassing detection and anti-scraping measures. The solution should ensure high accuracy, stability, and performance during testing.

The API must scrape item detail data from URLs adhering to the following format:

https://shopee.tw/a-i.{storeId}.{dealId}

For example: https://shopee.tw/a-i.178926468.21448123549

Path Organization

scraper-challange/
- src/
    - utils/
        - fetch_error_content.ts
        - fetch_delay.ts
    - index.ts          # Entry endpoint
    - scraper.ts        # Scraping data treatment
- dist/
- package.json
- package-lock.json
- tsconfig.json
- .env

Tech Spec

Express

To build API

Typescript

To write the api service codes

Node

Javascript runtime environment

ts-node package

Run automatically .ts files locally

nodemon package

To automatically restart the server on every file change

axios package

It is a promised based HTTP client for the browser

redis

It is a fast in-memory database to cache data, useful in this project to store api responses

express-rate-limit

Rate limiting 100 requests per 15 minutes

Local and Remote commands

npm run dev: Inicia o servidor node com typescript e nodemon

npm run build: Compila o typescript para js e alimenta dist/

npm start: Roda o código compilado (usado em prd)

npm run clean: Remove a pasta dist

Public routes - Endpoint

The specific format follows below:

https://shopee.tw/a-i.{storeId}.{dealId}

Exponential Backoff

The exponential backoff applied individually per requisition follows below equation:

exp_backoff = initial_backoff * (2 ^ (attempt - 1));

Where each variable meet the following values:

  • Initial backoff: 1 second;
  • Max backoff: 60 seconds;

Fallback

Consume get_rw endpoint

Cache

Use of redis in-memory database service. It runs on 6379 port by default and can be installed with:

sudo apt update
sudo apt install redis-server
sudo systemctl enable redis-server # enable automatic initialization
sudo systemctl start redis-server
redis-cli ping # must answer PONG

sudo systemctl restart redis-server # restart redis
redis-cli monitor # it allows us to monitor the server

sudo systemctl status redis-server # server status

sudo systemctl disable redis-server # disable server
sudo systemctl stop redis-server # stop server

About

Development of a scalable, undetectable API capable of retrieving detailed product data from Shopee Taiwan's item detail pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors