In the age where machine learning and big data can make or break a company, there is one resource more valuable than any other – data.

Unfortunately, the contextualized and structured data that companies often need is hard to come by.

But there is a solution – web scraping.

Web scraping is the universal API, allowing you to use the publicly available data displayed on websites for your own data analysis efforts and products.

Giving you access to a practically infinite number of data sources that you can use to gain a competitive advantage in your market.

 

guide3

What is Web Scraping?

Web scraping, also called web harvesting, or web data extraction, is the process of extracting or “scraping” data from a website.

If you have ever copied information from a website and pasted it into an Excel spreadsheet or Word doc then technically you are scraping a website.

But as you probably know, repeatedly having to copy and paste data from a website to an Excel spreadsheet can quickly become mind-numbingly boring.

Web scraping takes the pain out of this experience by automating the whole process.

Instead of sitting at a computer hitting Ctrl+C…Ctrl+V for hours, you set a web scraping script running that will extract your desired data from the website and store it in your preferred file format in a matter of seconds.

But here’s the really cool part…

When in the right hands not only does web scraping remove the need for soul-destroying data entry work, you can use web scraping to build some seriously powerful applications and even whole businesses with it.

crawl-the-web

What Can Web Scraping Be Used For?

A web scraper in the hands of an experienced and imaginative software developer is a recipe for magic.

You can write a web scraper to automate all types of scenarios. Any content you see on a webpage can be scraped if you are resourceful enough. Once you have scraped the data then you can use it in your software applications to give your business an edge over the competition, to monitor your reputation online or merely just to simplify your life.

As more and more of our business activities and our lives are being spent online there are infinite uses for web scrapers. All that is stopping you is your creativity.

Today, web scraping is used in everything from real estate to e-commerce, price comparison websites to SEO monitoring, lead generation to content aggregation.

price-monitoring-2

Price Monitoring Websites – Companies like Skyscanner and Trivago use web scrapers (along with APIs) to compare the prices of flights and hotels so us the user doesn’t need to painstakingly compare 20 different websites to find the best deal.

seo-1

SEO Monitoring –  SEO Monitoring tools like SEMRush, Ahrefs, Moz, etc. use web scrapers to scrape Google and other search engines to see which pages are ranking for which keywords. This data allows them to determine how hard it is to rank for any given keyword and audit the performance of your website.

real-estate-1

Real Estate – Numerous companies and startups are using web scraping to extract real estate listings from websites such as Zillow.com to enable them to build more advanced property search and investment evaluation tools, etc.

With the rapid pace of progress in machine learning and artificial intelligence, web scraping is only going to increase in importance. Web scrapers are unique in that fact that they can provide some of the highest quality data to classify and train predictive algorithms.

So you probably get the idea – web scraping can be really powerful if you use it right.

Now let’s look at how web scraping works so you can use it in your business or life.

How Does Web Scraping Work?

As we've discussed, web scraping is the process of extracting data from a website through the use of an automated software program or script.

At its simplest, a web scraping program follows 3 steps.

Step 1 – Requests The Contents Of A Webpage

The first step any web scraping program (called a “scraper”) makes is to request the target website for the contents of a specific URL.

For example, my hypothetical scraper might send a “GET” request to IMDb for the web page that contains “IMDb’s Top 250” movies.

web scraping - imdb

This information is usually returned to the scraper in HTML format. Which is the file type used to display all the textual information of a website.

Step 2 – Extracts The Desired Data

Once IMDb’s server returns the HTML file for the Top 250 movies web page, the scraper extracts the data we want from the HTML file.

In this case, we might want the scraper to extract the: ranktitleIMDb rating, and image thumbnail for all 250 movies in the list.

This is accomplished through a process called "parsing". Which is just programming speak for splitting up a block of data (string, etc) and converting the desired data into the format the script needs. Example: this script will "parse" the data string that contains the movie title "The Shawshank Redemption" and converts this data into a data structure that can be used within your program.

web scraping - imdb html file2

When a developer is writing a web scraping program, he specifically tells it what types of data it should parse and which data it should ignore.

Step 3 – Stores The Scraped Data

The final step is for the program to store this data in a CSV, JSON or in a database so it can be used manually or in another program.

imdb web scrape csv

And that’s how a web scraper works at its simplest.

Granted that is a very simplified example, and there are many other parts that can be added to this process to make it more powerful – like web crawling, proxy management, etc. – but without exception, every web scraper will follow this process.

Before we go on any further there is one small thing that I’d like to touch on…

The Difference Between Web Scraping and Web Crawling?

When it comes to web scraping, people often use the terms web scraping and web crawling interchangeably. Although web scrapers and web crawlers are ultimately designed to extract data from the internet, both of them operate slightly differently.

crawler

A web crawler, sometimes called a “spider”, is an autonomous bot that systematically browses the internet to index and search for content, by following the internal links on web pages. Crawlers are the backbones of search engines like Google, Bing, etc. 

scraper

A web scraper, on the other hand, is a tool designed to accurately extract data from a particular web page. This data is then stored for other uses. 

When web crawlers and web scrapers are combined, that’s when the real fun starts.

Combined web crawlers and web scrapers are able to roam an entire website (or the whole internet) for the specific type of data you are looking for. Once, the web crawler finds that data the web scraper will then parse and store the data then move on.

Scrapy, the open source python web scraping framework that Scrapinghub created and helps maintain, is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages. Making it a very powerful framework for building your very own web scraper.

So that begs the question…

How Can I Build A Web Scraper?

building-scraper

In the olden days, you had to write your own web scraper code from scratch yourself. But luckily for us, some smart people have taken the hard work out of building a web scraper for your application.

Today, you can use a web scraping framework like Scrapy (our favourite obviously), or libraries like BeautifulSoup or Selenium (Selenium is a web automation tool used more for web testing, however, some people use it to render Javascript whilst scraping a web page) that make writing your own web scraper super easy. Check out our complete guide to web scraping frameworks (coming soon) to see which one is the best for you.

Or if coding isn’t your thing, then you can use one of the ever-growing list of visual web scraping tools that allow you to scrap a website without writing a line of code. Check out our complete guide to the best visual web scraping tools (coming soon) to see how you can design your own scraper simply by clicking in your web browser.

If you need a large scale bespoke web scraping solution and don’t have a developer that can build it for you, then don’t worry. You can have a custom web scraping program built for you in a matter of days.

So by this stage, you might be asking yourself…

“This all sounds great! Is there anything that I need to be aware of if I’m going to use a web scraper?”

Which brings us to our final topic…

Is Web Scraping Legal?

Ok, you’ve come up with a great software application that will allow you to generate valuable insights for your business by scraping public information on numerous websites. Then the thought crosses your mind…

Will this land me in hot water?

Answer: By themselves, web scraping and crawling aren’t illegal, however how you scrape a website, the data you’re scraping, and how you plan to use the data must be taken into account when determining the legality of your scraping.

Disclaimer: I am not a lawyer and these comments are solely based on our experience working at Scrapinghub. If you want assistance with your specific situation you should consult a lawyer. 

People crawl and scrape other business’ websites all the time and these businesses are happy to let them do so. The best example being Google. Who crawls websites so that they can be indexed in their search results.

However, sometimes if your crawling and scraping activities are putting too much of a load on a website’s servers or if you are damaging someone’s business by scraping their data then they might take offense to this.

There have been a number of cases in the recent years where companies have taken web scrapers to court, however, the results have been mixed.

You can check out the legal section of Wikipedia’s Web Scraping article to get a better understanding of the legal cases that have involved web scraping.

Many large websites do explicitly forbid scraping in their terms of service, but oftentimes these have been found to be non-enforceable.

We talk in other guides about how you can make your scraper more “polite” so that it doesn’t get you into trouble.

scraping-legal

Need Help With A Web Scraping Project?

Get a free consultation from the world's leading web scraping experts.

 

NEED SOME ADVICE?