Datascience Practicals
Practical 1| Practical 2|Practical 3|Practical 4|Practical 5|Practical 6|Practical 7|Practical 8|Practical 9|Practical 10 |Practical 11 |Practical 12 |
Web Scraping using Python
If you want to access some data from website or any other internet platform so that we need to copy and paste data in local files but to easy the process to extract data from somewhere just use web scraping and implement that data in your research work and projects.
Web Scraping:
Web Scraping is a technique employed to extract large amount of data from websites whereby the data is extracted and saved to local file in computer or to database.
Applications of Web Scraping:
- Dynamic pricing
- Revenue optimization
- Competitor monitoring
- Product trend monitoring
- Brand and MAP compliance
Process of Web Scraping:
1. Identify the target website
2. Collect URLs of the pages where you want to extract data from
3. Make a request to these URLs to get the HTML of the page
4. Use locators to find the data in the HTML
5. Save the data in a JSON or CSV file or some other structured format
Tools For Scraping data
· Scrapy
· Scraper API.
· ParseHub
· Mozenda
· Content Grabber
· Urllib
Python Library for Implementation of Web Scraping
1. Requests:
Requests is the ideal choice when you’re starting with web scraping, and you have an API to contact with. It’s simple and doesn’t need much practice to master using. Requests also doesn’t require you to add query strings to your URLs manually. Finally, it has a very well written documentation and supports the entire restful API with all its methods (PUT, GET, DELETE, and POST).
2. Beautiful Soup:
Beautiful Soup is a Python library that is used to extract information from XML and HTML files. Beautiful Soup is considered a parser library. Parsers help the programmer obtain data from an HTML file. If parsers didn’t exist, we would probably use Regex to match and get patterns from the text, which is not an efficient or maintainable approach.
3. Selenium :
Selenium is an open-source web-based tool. Selenium is a web-driver, which means you can use it to open a webpage, click on a button, and get results. It is a potent tool that was mainly written in Java to automate tests.
4. Scrapy:
Scrapy is one of the most popular Python web scrapping libraries right now. It is an open-source framework. This means it is not even a library; it is rather a complete tool that you can use to scrape and crawl around the web systematically.
5. Urllib:
Urllib is a Python library that allows the developer to open and parse information from HTTP or FTP protocols. Urllib offers some functionality to deal with and open URLs, namely:
· urllib.request: opens and reads URLs.
· urllib.error: catches the exceptions raised by urllib.request.
· urllib.parse: parses URLs.
· urllib.robotparser: parses robots.txt files.
Implementation Of Web Scraping:
There are mainly two ways to extract data from a website:
Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.
Steps involved in web scraping:
Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.
Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.
Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files
In implementation we use above python libraries:
import requests
from bs4 import BeautifulSoup
import csv
CODE For Extracting values:
quote = {}
quote[‘theme’] = row.h5.text
quote[‘url’] = row.a[‘href’]
quote[‘img’] = row.img[‘src’]
quote[‘lines’] = row.img[‘alt’].split(“ #”)[0]
quote[‘author’] = row.img[‘alt’].split(“ #”)[1]
quotes.append(quote)
Output:
Above csv file has all data that extracted from website.
References:
For Code go through the following Github link: