If you are anyway serious about web scraping you’ll have quickly realised that proxy management is a critical component of any web scraping project.

When scraping the web at any reasonable scale, using proxies is a absolute must. However, it is common for managing and troubleshooting proxy issues to consume more time than building and maintaining the spiders themselves.

In this guide, we will breakdown the differences between the main proxy options and give you the information you need to consider when picking a proxy solution for your project or business.

proxy-guide-2

What Are Proxies And Why Do We Need Them When Web Scraping?

Before we discuss what a proxy is we first need to understand what an IP address is and how they work.

An IP address is a numerical address assigned to every device that connects to an Internet Protocol network like the internet, giving each device a unique identity. Most IP addresses look like this:

207.148.1.212

A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address in the process. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy, giving you the ability to scrape the web anonymously if you choose.

Currently, the world is transitioning from IPv4 to a newer standard called IPv6. This newer version will allow for the creation of more IP addresses. However, in the proxy business IPv6 are still not a big thing so most IPs still use the IPv4 standard.

When scraping a website, we recommend that you use a 3rd party proxy and set your company name as the user agent so the website owner can contact you if your scraping is overburdening their servers or if they would like you to stop scraping the data displayed on their website.

There are a number of reasons why proxies are important for web scraping:

  1. Using a proxy (especially a pool of proxies - more on this later) allows you to crawl a website much more reliably. Significantly reducing the chances that your spider will get banned or blocked.
  2. Using a proxy enables you to make your request from a specific geographical region or device (mobile IPs for example) which enable you to see the specific content that the website displays for that given location or device. This is extremely valuable when scraping product data from online retailers.
  3. Using a proxy pool allows you to make a higher volume of requests to a target website without being banned.
  4. Using a proxy allows you to get around blanket IP bans some websites impose. Example: it is common for websites to block requests from AWS because there is a track record of some malicious actors overloading websites with large volumes of requests using AWS servers. 
  5. Using a proxies enables you to make unlimited concurrent sessions to the same or different websites.
ip-and-proxy-1

Why Use A Proxy Pool?

proxy-pool

Ok, we now know what proxies are, but how do you use them as part of your web scraping?

In a similar way to if we only use our own IP address to scrape a website, if you only use one proxy to scrape a website this will reduce your crawling reliability, geotargeting options, and the number of concurrent requests you can make.

As a result, you need to build a pool of proxies that you can route your requests through. Splitting the amount of traffic over a large number of proxies.

The size of your proxy pool will depend on a number of factors:

  1. The number of requests you will be making per hour.
  2. The target websites - larger websites with more sophisticated anti-bot countermeasures will require a larger proxy pool.
  3. The type of IPs you are using as proxies - datacenter, residential or mobile IPs.
  4. The quality of the IPs you are using as proxies - are they public proxies, shared or private dedicated proxies? Are they datacenter, residential or mobile IPs? (data center IPs are typically lower quality than residential IPs and mobile IPs, but are often more stable than residential/mobile IPs due to the nature of the network).
  5. The sophistication of your proxy management system - proxy rotation, throttling, session management, etc.

All five of these factors have a big impact on the effectiveness of your proxy pool. If you don’t properly configure your pool of proxies for your specific web scraping project you can often find that your proxies are being blocked and you’re no longer able to access the target website.

In the next section we will look at the different types of IPs you can use as proxies.

What Are Your Proxy Options?

If you’ve done any level of research into your proxy options you will have probably realised that this can be a confusing topic. Every proxy provider is shouting from the rafters that they have the best proxy IPs on the web, with very little explanation as to why. Making it very hard to assess which is the best proxy solution for your particular project.

So in this section of the guide we will break down the key differences between the available proxy solutions and help you decide which solution is best for your needs. First, let’s talk about the fundamentals of proxies - the underlying IP’s.

As mentioned already, a proxy is just a 3rd party IP address that you can route your request through. However, there are 3 main types of IPs to choose from. Each type with its own pros and cons.

datacenter-ip

Datacenter IPs

Datacenter IPs are the most common type of proxy IP. They are the IPs of servers housed in data centers. These IPs are the most commonplace and the cheapest to buy. With the right proxy management solution you can build a very robust web crawling solution for your business.

residential-ip

Residential IPs

Residential IPs are the IPs of private residences, enabling you to route your request through a residential network. As residential IPs are harder to obtain, they are also much more expensive. In a lot of situations they are overkill as you could easily achieve the same results with cheaper data center IPs. They also raise legal/consent issues due to the fact you are using a persons personal network to scrape the web.

mobile-ip

Mobile IPs

Mobile IPs are the IPs of private mobile devices. As you can imagine, acquiring the IPs of mobile devices is quite difficult so they are very expensive. For most web scraping projects mobile IPs are overkill unless you want to only scrape the results shown to mobile users. But more significantly they raise even trickier legal/consent issues as oftentimes the device owner isn't fully aware that you are using their GSM network for web scraping.

Our recommendation is to go with data center IPs and put in place a robust proxy management solution. In the vast majority of cases, this approach will generate the best results for the lowest cost. With proper proxy management, data center IPs give similar results as residential or mobile IPs without the legal concerns and at a fraction of the cost.

Public, Shared or Dedicated Proxies

The other consideration we need to discuss is whether you should use public, shared or dedicated proxies.

As a general rule you always stay well clear of public proxies, or "open proxies". Not only are these proxies of very low quality, they can be very dangerous. These proxies are open for anyone to use, so they quickly get used to slam websites with huge amounts of dubious requests. Inevitably resulting in them getting blacklisted and blocked by websites very quickly. What makes them even worse though is that these proxies are often infected with malware and other viruses. As a result, when using a public proxy you run the risk of spreading any malware that is present, infecting your own machines and even making public your web scraping activities if you haven't properly configured your security (SSL certs, etc.).  

The decision between shared or dedicated proxies is a bit more intricate. Depending on the size of your project, your need for performance and your budget using a service where you pay for access to a shared pool of IPs might be the right option for you. However, if you have a larger budget and where performance is a high priority for you then paying for a dedicated pool of proxies might be the better option.  

Ok, by now you should have a good idea of what proxies are and what are the pros and cons of the different types of IPs you can use in your proxy pool. However, picking the right type of proxy is only part of the battle, the real tricky part is managing your pool of proxies so they don’t get banned.

How to Manage Your Proxy Pool?

If you are planning on scraping at any reasonable scale, just purchasing a pool of proxies and routing your requests through them likely won’t be sustainable longterm. Your proxies will inevitably get banned and stop returning high quality data.

Here are some the main challenges that you will face when managing your proxy pool:

  • Identify Bans - You proxy solution needs to be able to detect numerous types of bans so that you can troubleshoot and fix the underlying problem - i.e. captchas, redirects, blocks, ghosting, etc.
  • Retry Errors - If your proxies experience any errors, bans, timeouts, etc. they need to be able to retry the request with different proxies.
  • User Agents - Managing user agents is crucial to having a healthy crawl.
  • Control Proxies - Some scraping projects require you to keep a session with the same proxy, so you’ll need to configure your proxy pool to allow for this.
  • Add Delays - Randomize delays and apply good throttling to help cloak the fact that you are scraping.
  • Geographical Targeting - Sometimes you’ll need to able to configure your pool so that only some proxies will be used on certain websites.

Managing a pool of 5-10 proxies is ok, but when you have 100s or 1,000s it can get messy fast. To overcome these challenges you have three core solutions: Do It Yourself, Proxy Rotators and Done For You Solutions.  

do-it-yourself

Do It Yourself

In this situation you purchase a pool of shared or dedicated proxies, then build and tweak a proxy management solution yourself to overcome all the challenges you run into. This can be the cheapest option, but can be the most wasteful in terms of time and resources. Often it is best to only take this option if you have a dedicated web scraping team who have the bandwidth to manage your proxy pool, or if you have zero budget and can’t afford anything better.  

rotators

Proxy Rotators

The middle of the park solution is to purchase your proxies from a provider that also provides proxy rotation and geographical targeting. In this situation, the solution will take care of the more basic proxy management issues. Leaving you to develop and manage session management, throttling, ban identification logic, etc.

crawlera-logo

Done for You

The final solution is to completely outsource the management of your proxy management. Solutions such as Crawlera are designed as smart downloaders, where your spiders just have to make a request to it’s API and it will return the data you require. Managing all the proxy rotation, throttling, blacklists, session management, etc. under the hood so you don’t have to.

Each one of these approaches has it own pros and cons, so the best solution will depend on your specific priorities and constraints.

How to Pick The Best Proxy Solution For Your Project?

choose

Deciding on an approach to building and managing your proxy pool can be a headache. In this section we will outline some of the questions you need to be asking yourself when picking the best proxy solution for your needs:

  1. What’s your budget? If you have a very limited or virtually non-existent budget then managing your own proxy pool is going to be the cheapest option. However, if you have even a small budget of $20 per month then you should seriously consider outsourcing your proxy management to a dedicated solution that manages everything.
  2. What is your #1 priority? If learning about proxies and everything web scraping is your #1 priority then buying your own pool of proxies and managing them yourself is probably your best option. However, if your #1 priority is getting the web data you need and achieving maximum performance from your web scraping, as is the case for most companies, then it is nearly always better to outsource your proxy management solution to a done for your solution. Or at the very least, use a proxy rotator.  
  3. What is your technical skill level and your available resources? To be able to manage your own proxy pool for a reasonable size web scraping project you will need at least a basic level of software development expertise and the bandwidth to build and maintain your spiders proxy management logic. If you don’t have this expertise or don’t have the bandwidth to devote engineering resources to it then you are often better off either using a proxy rotator and building your own proxy management infrastructure or using a done for you proxy management solution.  

Your answers to these questions will quickly help you decide which approach to proxy management best suits your needs.

Build In-house or Done For You Solutions?

As outlined above, if you are more focused on learning everything about web scraping from the ground up or have a very tight budget then buying access to a shared pool of IPs and managing the proxy management logic yourself is probably your best option.

However, if your focus is on getting the web data you need with little to no hassle or maximising your web scraping performance then you should really look into using either a proxy rotator and building the other management infrastructure in-house, or use a done for you proxy management solution.

premium-ips2

Proxy Rotator

As we discussed, if you want to go it alone then at the very least you should use a proxy provider that offers proxy rotation as a service. This will remove the first layer of managing your proxies. However, you will still have to implement your own session management, request throttling, IP blacklisting and ban identification logic. 

done-for-you2

Done For You  

The other approach is to use intelligent algorithms to automatically manage your proxies for you. With this approach instead of having to rely on very expensive residential and mobile IPs to get clean data, purpose-built proxy management solutions are able to manage the rotation, throttling, and selection of data center IPs so that they return consistent clean data. Only using expensive IPs when there is no other option. Here your best option is a solution like Crawlera, the smart downloader developed by Scrapinghub.

crawlera-logo-ad
Crawlera

The World’s Smartest Proxy Network

TRY CRAWLERA NOW    REQUEST A QUOTE

With Crawlera, instead of having to manage a pool of IPs your spiders just send a request to Crawlera's single endpoint API to retrieve the desired data. Crawlera manages a massive pool of proxies, carefully rotating, throttling, blacklists and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs. Users are able to focus on the data, not proxies.

The huge advantage of this approach is that it is extremely scalable. Crawlera can scale from a few hundred requests per day to hundreds of thousands of requests per day without any additional workload on your part. Better yet, with Crawlera you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.

crawlera-proxy-solutions

Legal Considerations When Using Proxies

scraping-legal

By this stage, you should have a good idea of what proxies are and how to choose the best option for your web scraping project. However, there is one consideration that many people overlook when it comes to web scraping and proxies, that is the legal considerations.

The act of using a proxy IP to visit a website is legal, however, there are a couple things you need to keep in mind to make sure you don’t stray into a grey area.

Having a robust proxy solution is akin to having a superpower, but it can also make you sloppy. With the ability to make a huge volume of requests to a website without the website being easily able to identify you, people can get greedy and overload a website's servers with too many requests. Which is never the right thing to do.

If you are a web scraper you should always be respectful to the websites you scrape. No matter the scale or sophistication of your web scraping operation you should always comply with web scraping best practices (Web Scraping Best Practices Guide Coming Soon) to ensure your spiders are polite and cause no harm to the websites you are scraping. Additionally, if the website informs you (or informs the proxy provider) that your scraping is burdening their site or is unwanted, you should limit your requests or cease scraping, depending on the complaint received. So long as you play nice, it's much less likely you will run into any legal issues.

As mentioned in our Web Scrapers Guide to GDPR, the other legal consideration you need to make when using residential or mobile IPs is do you have the IPs owners explicit consent to use their IP for web scraping?

As GDPR defines IP addresses as personally identifiable information you need to ensure that any EU residential IPs you use as proxies are GDPR compliant. This means that you need to ensure that the owner of that residential IP has given their explicit consent for their home or mobile IP to be used as a web scraping proxy.

If you own your own residential IPs then you will need to handle this consent yourself. However, if you are obtaining residential proxies from a 3rd party provider then you need to ensure that they have obtained consent and are in compliance with GDPR prior to using the proxy for your web scraping project.

Need Help With A Web Scraping Project?

Get a free consultation from the world's leading web scraping experts.

NEED SOME ADVICE?