title

How do I web scrape data without getting blocked?

name

James Hunt

01-14-2022

Currently it is one of the most hyped topics to programmatically extract the data you need from websites and make a good use of it as an external source of information in your projects. This process is generally referred as web scraping and it is a contraversial topic in terms of being illegal/legal. Most of the big websites use anti-web scraper tools to avoid being web scraped.

When a website detects that a particular IP address is sending requests that don't seem 'human', it tends to set off an alarm and they will either check that you're human (make you fill in a CAPTCHA) or block you.

Few things to note:

Never repeatedly hit the websites to crawl. Delay it after every set of requests

Change the concurrent requests

Use proxies/ IP rotation

Use geographically

Use proper user proxies and change them at times

Set up your cron jobs during odd times- less traffic. Like late nights or early mornings. Not all times this works.

Scrapy - A python web scraping framework has few more configs to take care helping you not to get blocked.

Recent posts

title
username

James Hunt

What is a browser fingerprint?
01-07-2022