James Hunt
01-14-2022Currently it is one of the most hyped topics to programmatically extract the data you need from websites and make a good use of it as an external source of information in your projects. This process is generally referred as web scraping and it is a contraversial topic in terms of being illegal/legal. Most of the big websites use anti-web scraper tools to avoid being web scraped.
When a website detects that a particular IP address is sending requests that don't seem 'human', it tends to set off an alarm and they will either check that you're human (make you fill in a CAPTCHA) or block you.
Few things to note:
Never repeatedly hit the websites to crawl. Delay it after every set of requests
Change the concurrent requests
Use proxies/ IP rotation
Use geographically
Use proper user proxies and change them at times
Set up your cron jobs during odd times- less traffic. Like late nights or early mornings. Not all times this works.
Scrapy - A python web scraping framework has few more configs to take care helping you not to get blocked.