title

How to deal with Python scraping Amazon Web page 502 error?

name

James Hunt

01-19-2022

There are always problems in crawling Amazon's commodity web pages. It can be seen that Amazon's anti crawler mechanism is still very good. First, use the get method of the requests library to crawl and find that the status code is 503

import requests

url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"r=requests.get(url)print(r.status_code)

Explain that you didn't crawl, and then disguise yourself as a browser by modifying the header information

import requests

kv = {"user-agent": "Mozilla/5.0"}
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[:1000])
except:
    print("Crawl failed")

However, this time, the result crawled is not the desired result, but it is judged as an automatic program by the server

Then, according to the prompt, the relevant parts of the cookie are added:

The final code is as follows:

import requests
from bs4 import BeautifulSoup
kv = {"user-agent": "Mozilla/5.0"}
cookie = {
"cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60",
"CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031",
"Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313",
"Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368"
}
url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y'
r = requests.get(url,cookies=cookie,headers=kv)
r.encoding = r.apparent_encoding
print(r.text[:1000000])

Roxlabs is the fastest residential proxy provider. We provide more than 90 million residential IP resources around the world, and are constantly committed to expanding the current proxy pool to meet the needs of each customer. Roxlabs supports enterprises, individual users and developers to use our proxies in different use cases. Including but not limited to the following use cases: Web data capture, SEO, price monitoring, advertising verification, brand protection, market research, social media management, e-mail protection, travel expense aggregation, sneaker rush purchase.

Recent posts

title
username

James Hunt

Why not use proxy sites?
02-12-2022
title
username

Ana Quil

How big can a proxy pool be?
12-29-2021
title
username

Sandra Pique

Why to use proxy if you have VPN?
01-13-2022