title

Several methods of using proxy for web page data scraping

name

Ana Quil

01-29-2022

Several methods of using proxy for web page data scraping

If you do not use proxies, the crawler business is certainly not feasible, so most crawler workers will use safe and stable proxies. Will there be no worries after using high-quality proxies? It's not that easy. We need to improve the scheme, allocate resources effectively and improve work efficiency.

Each process randomly takes out the IP list from the interface for reuse.

After failure, call API to get it. The general logic is as follows:

1. In each process, some IP is retrieved randomly from the interface, and the IP directory is tried repeatedly to obtain data;

2. If the visit is successful, continue to grab the next one.

3. If it fails, take a batch of IPS from the interface and continue to try.

Disadvantages: each IP has a deadline. If you extract 100, use the 20th, and most of the rest may not be available. If the connection time exceeds 3 seconds and the reading time exceeds 5 seconds when setting the HTTP request, it may take 3-8 seconds, and it may be grabbed hundreds of times in 3-8 seconds.

Each process randomly takes out the IP list from the interface for reuse.

If it fails, call the API to get the IP. The general logic is as follows:

1. Each process randomly retrieves an IP from the interface and uses it to browse resources,

2. If the visit is successful, continue to grab the next one.

3. If it fails, randomly take an IP from the interface and continue to try.

Disadvantages of the scheme: calling the API to obtain IP is very frequent, which will put great pressure on the proxy server, affect the stability of the API interface, and may restrict the extraction. This scheme is not suitable for long-term stable operation.

First, extract a large number of IPS and import them into the local database. The general logic of extracting IPS from the database is as follows.

1. Create a table in the database, write an import script, how many APIs are required per minute (consult the proxy IP service provider for advice), and import the IP list into the database.

2. Record the import time, IP, port, expiration time, IP availability status and other fields;

3. Write a grab script to read the available IP from the database. Each process obtains an IP from the database for use.

4. Perform fetching, judge results, process cookies, etc., As long as there is a verification code or fails, give up the IP and replace the IP again.

The scheme effectively avoids the consumption of proxy server resources, effectively allocates the use of proxies, is more efficient and stable, and ensures the persistence and stability of crawler work.

Recent posts

title
username

James Hunt

Why hide the IP address?
01-04-2022
title
username

Sandra Pique

What is an HTTP proxy server?
01-29-2022