James Hunt
01-19-2022There are always problems in crawling Amazon's commodity web pages. It can be seen that Amazon's anti crawler mechanism is still very good. First, use the get method of the requests library to crawl and find that the status code is 503
import requests url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"r=requests.get(url)print(r.status_code)
Explain that you didn't crawl, and then disguise yourself as a browser by modifying the header information
import requests kv = {"user-agent": "Mozilla/5.0"} url="https://www.amazon.cn/gp/product/B01M8L5Z3Y" try: r=requests.get(url,headers=kv) r.raise_for_status() r.encoding=r.apparent_encoding print(r.text[:1000]) except: print("Crawl failed")
However, this time, the result crawled is not the desired result, but it is judged as an automatic program by the server
Then, according to the prompt, the relevant parts of the cookie are added:
The final code is as follows:
import requests from bs4 import BeautifulSoup kv = {"user-agent": "Mozilla/5.0"} cookie = { "cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60", "CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031", "Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313", "Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368" } url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y' r = requests.get(url,cookies=cookie,headers=kv) r.encoding = r.apparent_encoding print(r.text[:1000000])
Roxlabs is the fastest residential proxy provider. We provide more than 90 million residential IP resources around the world, and are constantly committed to expanding the current proxy pool to meet the needs of each customer. Roxlabs supports enterprises, individual users and developers to use our proxies in different use cases. Including but not limited to the following use cases: Web data capture, SEO, price monitoring, advertising verification, brand protection, market research, social media management, e-mail protection, travel expense aggregation, sneaker rush purchase.