When scraping a website you should always consider whether the web data you are planning to extract is copyrighted.
Copyright is defined as the exclusive legal right over a physical piece of work — like an article, picture, movie, etc. It basically means, if you create it, you own it. In order to be copyrightable, the work needs to be original and tangible.
The common types of material on the web that might be copyrighted are:
As a result, copyright is very relevant to scraping because much of the data on the internet (like articles and videos) are copyrighted works.
However, there are some situations when exceptions can apply to all or part of the data enabling it to be legally scraped without infringing on the owner's copyright.
- Fair Use: Fair Use is an exception that permits limited use of copyrighted material. Typically, fair use includes categories such as criticism/parody, comment, news reporting, teaching, scholarship, and research. One example of fair use is the publishing of short snippets of articles with links, which is generally okay under the fair use exception due to the transformative and limited nature of the use.
- The factors commonly used to determine if the fair use exception applies are: (1) the purpose and character of your use (ie is it transformative in some way); (2) the nature of the work (ie fact v. fiction or published v. unpublished); (3) the amount taken, the less you copy the better; and (4) the effect upon the potential market, meaning the extent to which your use may deprive the owner of income or a potential market opportunity.
- Transformative Use: One factor in determining fair use is whether the usage is transformative. Instead of distributing and storing exact duplicates or lengthy portions of the crawled website, transform the content and the use of the content in some way so that you are not violating copyright.
- Facts: The facts within copyrighted material are often not covered by copyright laws, so if you limit what is being scraped to just the factual matters -- ie names of products, price, etc, then it is acceptable to scrape.
Note that different countries have different exceptions to copyright law, and you should always ensure that an exception applies within the jurisdiction within which you’re operating.