Web scraping

(Notes credits from these links: https://en.wikipedia.org/wiki/Web_scraping) (https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) (https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460)

web scraping, by its name, is extract data(audios, videos, pictures, numbers, words) from the internet web. major two steps of scraping a webpage:

fetching: Download web pages;

extraction: Parse, search, reform, clean of the Downloaded pages;

Web scraping can be used in a lot of popular areas:

web indexing
web mining and data mining
online price change monitoring and price comparison
product review scraping (to watch the competition)
gathering real estate listings, weather data monitoring
website change detection
research
tracking online presence and reputation,
web mashup
web data integration

Scraping techniques:

Human copy and paste: the simplest way
Text pattern matching: use Linux gre command and RegEx to extract matched info.
HTTP programming: retrieve web pages by post HTTP requests.
DOM parsing: use client side script to retrieve web contents
Vertical aggregation: create automatic “bots” to do man’s job
Semantic annotation recognizing: a special case of DOM parsing

Scraping avoid blocking

most of the websites have anti-scraping mechanisms for abusive access. crawler basic rule: ‘be nice’ and follow websites’ polices

A few tips

slower your crawler, make your crawler diverse
request through proxies and rotates them
rotate user agents,request headers
use headless browser(selenuim, puppeteer, playwright etc)
escape honey pots
check layout changes of websites
avoid logged in scraping
use captcha service if that is the case

When you are blocked

these are the signs that your crawler may get blocked:

CAPTCHA pages
Unusual content delivery delays
Frequent response with HTTP 404, 301 or 500’ errors
Beautiful Soup

Most popular web scraping python lib.

pip install beautifulsoup4.

Steps of simple Web scraping using beautiful soup

inspect the websites use browser tools
Python code with beautiful soup ```angular2html
required libs

import requests import urllib.request import time from bs4 import BeautifulSoup

- set url for request access
```angular2html
url = 'http://sample.com/a.html'
response = requests.get(url)

-Parse html with BeautifulSoup

extract data
set time sleep to act not like a crawler.

Things I want to know more

I will be interested to see a live demo, an implementation, to see how beatifulsoap works.

reading-notes

Software Development Reading Notes

Web scraping

Scraping avoid blocking

When you are blocked

Beautiful Soup

Steps of simple Web scraping using beautiful soup

required libs

Things I want to know more