In today’s rapidly evolving technological landscape, building advanced web scraping tools is essential for extracting valuable data efficiently. Your challenge is to design an undetectable web scraper capable of bypassing Shopee’s anti-scraping mechanisms. Additionally, the scraper should be scalable, and capable of handling high volumes of data without compromising performance.
Your challenge is to develop a scalable, undetectable API capable of retrieving detailed product data from Shopee Taiwan's item detail pages. The API must scrape and return the full JSON response from Shopee’s official get_pc or get_rw APIs while effectively bypassing detection and anti-scraping measures. The solution should ensure high accuracy, stability, and performance during testing.
The API must scrape item detail data from URLs adhering to the following format:
https://shopee.tw/a-i.{storeId}.{dealId}
For example: https://shopee.tw/a-i.178926468.21448123549
scraper-challange/
- src/
- utils/
- fetch_error_content.ts
- fetch_delay.ts
- index.ts # Entry endpoint
- scraper.ts # Scraping data treatment
- dist/
- package.json
- package-lock.json
- tsconfig.json
- .env
To build API
To write the api service codes
Javascript runtime environment
Run automatically .ts files locally
To automatically restart the server on every file change
It is a promised based HTTP client for the browser
It is a fast in-memory database to cache data, useful in this project to store api responses
Rate limiting 100 requests per 15 minutes
npm run dev: Inicia o servidor node com typescript e nodemon
npm run build: Compila o typescript para js e alimenta dist/
npm start: Roda o código compilado (usado em prd)
npm run clean: Remove a pasta dist
The specific format follows below:
https://shopee.tw/a-i.{storeId}.{dealId}
The exponential backoff applied individually per requisition follows below equation:
exp_backoff = initial_backoff * (2 ^ (attempt - 1));
Where each variable meet the following values:
- Initial backoff: 1 second;
- Max backoff: 60 seconds;
Consume get_rw endpoint
Use of redis in-memory database service. It runs on 6379 port by default and can be installed with:
sudo apt update
sudo apt install redis-server
sudo systemctl enable redis-server # enable automatic initialization
sudo systemctl start redis-server
redis-cli ping # must answer PONG
sudo systemctl restart redis-server # restart redis
redis-cli monitor # it allows us to monitor the server
sudo systemctl status redis-server # server status
sudo systemctl disable redis-server # disable server
sudo systemctl stop redis-server # stop server