How does a crawler use a proxy?

Crawler IP is definitely not feasible if you don't use an IP proxy, so most crawler operators will use a secure and stable proxy IP. Aren't you worried about using a high-quality proxy IP? It's not easy. In addition, improve the program, effective allocation of resources, improve work efficiency.


First, each process interface randomly obtain IP list repeatedly use, invalid call API to obtain.


The general logic is as follows:


1, each process, random from the interface to recover part of the IP, repeatedly try IP directory to capture data


2. If the access succeeds, continue to grab the next one.


3. After the failure, obtain the IP address from the interface and try again.


Disadvantages: all IP addresses have a deadline, extract 100, use the 20th, the rest may not be available. Setting the connection time of an HTTP request for more than 3 seconds and the read time for more than 5 seconds may take 3 to 8 seconds and may catch hundreds of times within 3 to 8 seconds.


First, extract a large number of IP, import the local database, and then extract IP from the database.


The general logic is as follows:


1. Create a table in the database, write out how many API import scripts are required per minute (ask the proxy IP service provider for advice), and import the IP list into the database.


2. Record fields such as import time, IP address, port number, expiration time, and AVAILABLE IP status in the database.


3, write a crawl script, the crawl script from the database to read the available IP, each process from the database to obtain an IP usage.


4, capture, judge the results, cookie processing, etc., as long as there is verification code or error, give up THE IP, replace the IP.

If you want to try to use proxy IP, you can enter RoxLabs proxy IP official website to learn more, providing high stealth stable proxy IP.