reading-notes

Software Development Reading Notes

View on GitHub

Web scraping

(Notes credits from these links: https://en.wikipedia.org/wiki/Web_scraping) (https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) (https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460)

web scraping, by its name, is extract data(audios, videos, pictures, numbers, words) from the internet web. major two steps of scraping a webpage:

fetching: Download web pages;

extraction: Parse, search, reform, clean of the Downloaded pages;

Web scraping can be used in a lot of popular areas:

Scraping techniques:

Scraping avoid blocking

most of the websites have anti-scraping mechanisms for abusive access. crawler basic rule: ‘be nice’ and follow websites’ polices

A few tips

When you are blocked

these are the signs that your crawler may get blocked:

pip install beautifulsoup4.

Steps of simple Web scraping using beautiful soup

import requests import urllib.request import time from bs4 import BeautifulSoup

- set url for request access
```angular2html
url = 'http://sample.com/a.html'
response = requests.get(url)

-Parse html with BeautifulSoup

Things I want to know more

I will be interested to see a live demo, an implementation, to see how beatifulsoap works.