Web Scraping in general data extraction from the websites. Any kind of data that is available on the internet can be extracted with the right tool.
Scrapy is a Web scraping framework of python. Similar to python’s built-in request library and BeutifulSoup this is also used together with data from the websites.
But unlike those tools, this can be used to build complex and big Scraping projects.
When we talk about scrapping the hundreds and hundreds of pages we really need to use the scrappy, because of of it builts for it.
It is able to scrape thousands of pages per minute.
Scrapy has 5 main components
- Spiders – Spider component is where we define from where and what we want to extract. Here we have 5 kinds of Spiders classes available for us
- Pipelines – Pipelines perform the data related operations such as :
- Cleaning the data
- Remove the duplication
- Storing the data
- Middlewares – Middlewares are performed when we request and receive a response from the web pages. We can use it in injecting custom headers and Proxying.
- Engine – The engine is responsible for coordinating between all of the other components. It ensures the consistency that all the operation that happens.
- Scheduler – Scheduler is a simple data instructor, it is responsible for preserving the order of operation.
Almost all of the sites include Robots.txt in the root directory. This file indicates which part of the site we are allowed to scrape and which part we are not.
This file is generally created by the site owner to instruct the search engine bots that which part of the site that they want to index on search engines.
This file basically contains 3 major elements:
- User-Agent – Refers to User, Bots, that crawl the site.
- Allow – Specify the mentioned URL is allowed to index.
- Disallow – Specify the mentioned URL is not allowed to index.
You can check a real-world example of robots.txt on https://Www.Facebook.Com/robots.txt.