Web scraping
(Notes credits from these links: https://en.wikipedia.org/wiki/Web_scraping) (https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) (https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460)
web scraping, by its name, is extract data(audios, videos, pictures, numbers, words) from the internet web. major two steps of scraping a webpage:
fetching: Download web pages;
extraction: Parse, search, reform, clean of the Downloaded pages;
Web scraping can be used in a lot of popular areas:
- web indexing
- web mining and data mining
- online price change monitoring and price comparison
- product review scraping (to watch the competition)
- gathering real estate listings, weather data monitoring
- website change detection
- research
- tracking online presence and reputation,
- web mashup
- web data integration
Scraping techniques:
- Human copy and paste: the simplest way
- Text pattern matching: use Linux gre command and RegEx to extract matched info.
- HTTP programming: retrieve web pages by post HTTP requests.
- DOM parsing: use client side script to retrieve web contents
- Vertical aggregation: create automatic “bots” to do man’s job
- Semantic annotation recognizing: a special case of DOM parsing
Scraping avoid blocking
most of the websites have anti-scraping mechanisms for abusive access. crawler basic rule: ‘be nice’ and follow websites’ polices
A few tips
- slower your crawler, make your crawler diverse
- request through proxies and rotates them
- rotate user agents,request headers
- use headless browser(selenuim, puppeteer, playwright etc)
- escape honey pots
- check layout changes of websites
- avoid logged in scraping
- use captcha service if that is the case
When you are blocked
these are the signs that your crawler may get blocked:
- CAPTCHA pages
- Unusual content delivery delays
- Frequent response with HTTP 404, 301 or 500’ errors
Beautiful Soup
Most popular web scraping python lib.
pip install beautifulsoup4.
Steps of simple Web scraping using beautiful soup
- inspect the websites use browser tools
- Python code with beautiful soup
required libs
import requests import urllib.request import time from bs4 import BeautifulSoup
- set url for request access
url = 'http://sample.com/a.html'
response = requests.get(url)
-Parse html with BeautifulSoup
- extract data
- set time sleep to act not like a crawler.
Things I want to know more
I will be interested to see a live demo, an implementation, to see how beatifulsoap works.