Whilst tinkering, I’ve been doing a bit of data scraping ie automatically pulling out bits of information from web pages and re-using them. I’ve been a little concerned about this, because I’m not sure if it has any impact on the system that provides the information I’m scraping from. NB: I’m not going in and pulling out tons of information. I only do it against specific queries to pull out a handful of web pages, which I then manipulate and I do it infrequently.
I’m not so much concerned about the ethics of using scraped data – I’m not branding it as my own, or making money from it. In fact, 99.99% of the time I’m the only person who sees it/uses it. I’m just presenting it in a way that is more useful to me. I am really just concerned about the impact my data-scraping has on the server that is hosting the web page I’m scraping.
I can see that it might have an impact on my host server if I’m pulling out lots of information. In fact, I’ve got into a bit of trouble doing this using RunBasic, as I stupidly hadn’t thought about the strain it was putting on my host server when I kept testing something online. (I’ve reverted back to running scraping via RunBasic on my own PC now!) As well as RunBasic, I’ve been using Yahoo pipes to data scrape.
Looking at how the information comes into the systems it seems that it just calls up the web page I want, caches it off-site and the manipulation goes on off-site, so it can’t have an impact on the host server (the data originally came from) itself. It seems it’s the same as if I called up a web page normally (via the address bar or a search), looked at the source code, copied it to notepad and tweaked it there. Is this right or wrong? I’m happy to be re-educated in a way that doesn’t sound patronising or rude. 😉
I’ve read around this a bit and some people suggest it does have more of an impact and others say it doesn’t.
So, if anyone can say for definite and explain it in, words of, ummm!… 4 syllables or less I’d appreciate it. Thanks.