What are the anti-crawler strategies and countermeasures?

Due to the continuous development of the Internet, the war between crawler and anti-crawler has never stopped. Today xiaobian for you to share a few complex anti-crawler strategies, let's see ~.

 

1. Data masquerade.

 

On a Web page, the crawler can monitor traffic and then simulate a normal request from the user. In this example, some sites add complexity by disguising the data. For example, a website with a display price of $299, CSS masquerade for DOM trees. You must do some calculations in the CSS rules to get the correct values. In this case, crawlers have to be very careful, because once the target site is modified, the rules change and the crawler data becomes invalid.

 

2. Parameter signature. APP calculates request parameters through encryption algorithm to obtain signature.

 

Signatures are typically timestamped and add a timestamp to the request. It works for a short time and is a fixed parameter. After the server sends the request, the server verifies the parameters, timestamp, and compares whether the signatures are consistent. If they are inconsistent, the request is considered illegal. It is often difficult to get encryption algorithms on the APP side, and decompilation is often required to get encryption algorithms.

 

3. Hide validation. Hiding validation is one of the most complex methods.

 

For example, in terms of protecting websites, JavaScript requests for specific websites can give you specific tokens, so that each request produces a different token. Some sites even add special request parameters to invisible images to determine if they are real browser users. In such cases, it is often not feasible or very difficult to request the API directly, and this can only be avoided by using tools such as ChromeHeadless to simulate the user's behavior.

 

4. Stop debugging.

 

One of these anti-crawler strategies is special. Once you open the browser's console interface, it triggers the browser's debugging instructions indefinitely. The site adds a debugger for all structural functions, called leonid-tq-jq-v3-min.js, which triggers the debugger when any object is generated. Its purpose is to protect code by preventing unexpected scripts or programs from being used for tracing and debugging. In this case, you can get around this limitation by building a modified JS file, removing the debugger keyword, using mitmProxy to forward traffic, intercepting leonid-tq-jq-v3-min.js, and sending the modified JS file back to the browser.

 

If you need multiple different proxy IP, we recommend using RoxLabs proxy:https://www.roxlabs.io/?, including global Residential proxies, with complimentary 500MB experience package for a limited time.