Is Data Scaping Naughty?

Standard

Whilst tinkering, I’ve been doing a bit of data scraping ie automatically pulling out bits of information from web pages and re-using them. I’ve been a little concerned about this, because I’m not sure if it has any impact on the system that provides the information I’m scraping from. NB: I’m not going in and pulling out tons of information. I only do it against specific queries to pull out a handful of web pages, which I then manipulate and I do it infrequently.

I’m not so much concerned about the ethics of using scraped data – I’m not branding it as my own, or making money from it. In fact, 99.99% of the time I’m the only person who sees it/uses it. I’m just presenting it in a way that is more useful to me. I am really just concerned about the impact my data-scraping has on the server that is hosting the web page I’m scraping.


Data Scraping Guilt Complex Flowchart

I can see that it might have an impact on my host server if I’m pulling out lots of information. In fact, I’ve got into a bit of trouble doing this using RunBasic, as I stupidly hadn’t thought about the strain it was putting on my host server when I kept testing something online. (I’ve reverted back to running scraping via RunBasic on my own PC now!) As well as RunBasic, I’ve been using Yahoo pipes to data scrape.

Looking at how the information comes into the systems it seems that it just calls up the web page I want, caches it off-site and the manipulation goes on off-site, so it can’t have an impact on the host server (the data originally came from) itself. It seems it’s the same as if I called up a web page normally (via the address bar or a search), looked at the source code, copied it to notepad and tweaked it there. Is this right or wrong? I’m happy to be re-educated in a way that doesn’t sound patronising or rude. 😉

I’ve read around this a bit and some people suggest it does have more of an impact and others say it doesn’t.

So, if anyone can say for definite and explain it in, words of, ummm!… 4 syllables or less I’d appreciate it. Thanks.

Advertisements

3 thoughts on “Is Data Scaping Naughty?

  1. I sometimes wonder the same, so you aren’t the only one! I keep thinking I should check whether the host has a robots.txt file and what it says – although I note that the Yahoo Pipes FAQ says “Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol (when fetching feeds), and won’t check your robots.txt file.” – I guess it follows your reasoning that this isn’t so different to using a browser to access the page.

    However, the speed with which you can make multiple requests with Yahoo pipes and any other scripting mechanism can be an issue – if you request a page, and then (for example) request 10 more pages based on the content of the first page, then you are doing something different to just a normal web browser. I often do this type of scripting when I want to pull a set of disparate (on the original website) data and display it back in a single place.

    As with you, any of the stuff that I do is basically only for my own benefit – so I don’t really have a huge expectation that others are going to use the scripts I throw out there – and if they do, I’ll notice the impact on my host as well, so I’ll be motivated to ensure it doesn’t get out of hand! Also, the web really is designed for this type of stuff – any decent webserver setup will cache pages to re-serve them quickly where necessary.

    I think if you are going to be requesting the same thing many times then you may as well cache it at your end – it saves the network traffic and will make your app snappier anyway.

    However, all that said, from your description, if you are just requesting a single page at a time – then there is no reason for this to have any different impact to a normal web browser, so I definitely wouldn’t feel bad about that.

    • Gary Green

      Thanks Owen.

      Re. “if you request a page, and then (for example) request 10 more pages based on the content of the first page, then you are doing something different to just a normal web browser.” I do this on the odd occassion, but again, I’d see this as the same as going from one web page to another, just more quickly. I should probably think about caching the pages at my end when appropriate – don’t want to cause problems if I can avoid them.

      Thanks for head up about the robot.txt files. I’ve seen them on some sites, but hadn’t realised they were a common/standard.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s