Autos.yahoo.com Data Scraping: 2015

Friday, 3 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Friday, 26 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Saturday, 20 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/" "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1024648/" "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/" "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer http://www.imdb.com/title/tt1045658/

## 2 The Hobbit: An Unexpected Journey     2 8.2 / trailer http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Tuesday, 2 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload) # This'll be the results page

html = r.text

root = lxml.html.fromstring(html) # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Friday, 29 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get HTML Document text and that text then parsed to extract data from the HTML codes. Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service. If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library. PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Data Mining Services

Data Minng Services, through its data mining services can mine required data for you from any of the available sources. Over the years, we have successfully catered to wide variety of outsource data mining requirements, which specifies our competency in dealing with your data mining requirements.

Based on your requirements, we can mine data from your preferred data sources, or we will use our own reliable sources to mine the data required by you. We have been using automated as well manual data mining strategies to deliver superior data mining services.

Types of data mining services delivered by us

With an extensive variety of data mining services provided by us, you will definitely be able to find the most perfect service package to cater to your requirements. Below listed are just some of the data mining services offered by us:

•    Web data mining
•    Data extraction
•    Data capture
•    Data gathering
•    Collection of required data
•    Validation of data

Outsource data mining requirements to us, and we are sure that the data mining India unit of Hi-Tech BPO Services will be able to formulate the most appropriate and cost effective solutions to include your entire requirements.

Highlights of our data mining services:

•    Most affordable rates
•    Dedicated data mining India unit
•    Latest data mining technologies used to mine all required data
•    Data will be mined, gathered, processed and validated as per your requirements
•    Mined data can be directly included into your database

Competitive advantage of using our data mining services

To mine accurate and relevant data, some level of internet knowledge is essential. And it would also consume a lot of your valuable time. With our data mining services, we will take care of all your data mining tasks, while you look after your business and its core functions.

The affordably priced data mining services delivered by the data mining India unit will also help you to save considerable amount of your money, which you can put into more productive purposes.

Source: http://www.hitechbposervices.com/data-mining.php

Monday, 25 May 2015

Improving performance for web scraping code

2 down vote favorite

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites), extracts the relative part and then renders the html page.

What should I do to get an optimal performance. Should I change from shared hosting (godaddy) to my own server or it has nothing to do with my hosting and I need to make changes to my code?

1 Answer

API/CSV

Ask those websites if they provide an API, or, if you don't need an up-to-date information or the information you need doesn't change frequently, if they can sell/give you for free the data itself (for example in an CSV file). Some small websites may have fancier ways to access data, like a CSV file for the older information, and an RSS feed for the changed one.

Those websites would probably be happy to help you, since providing you with an API would reduce their own CPU and bandwidth usage by you.

Profile

Screen scrapping is really ugly when it comes to performance and scaling. You may be limited by:

    your machine performance, since parsing, sometimes an invalid HTML file, takes time,

    your network speed,

    their network speed usage, i.e. how fast can you access the pages of their website depending on the restrictions they set, like the DOS protection and the number of requests per second for screen scrappers and search engine crawlers,

    their machine performance: if they spend 500 ms. to generate every page, you can't do anything to reduce this delay.

If, despite your requests to them, those websites cannot provide any convenient way to access their data, but they give you a written consent to screen scrape their website, then profile your code to determine the bottleneck. It may be the internet speed. It may be your database queries. It may be anything.

For example, you may discover that you spend too much time finding with regular expressions the relevant information in the received HTML. In that case, you would want to stop doing it wrong and use a parser instead of regular expressions, then see how this improve the performance.

You may also find that the bottleneck is the time the remote server spends generating every page. In this case, there is nothing to do: you may have the fastest server, the fastest connection and the most optimized code, the performance will be the same.

Do things in parallel:

Remember to use parallel computing wisely and to always profile what you're doing, instead of doing premature optimization, in hope that you're smarter than the profiler.

Especially when it comes to using network, you may be very surprised. For example, you may believe that making more requests in parallel will be faster, but as Steve Gibson explains in episode 345 of Security Now, this is not always the case.

Legal aspects

Also note that screen scrapping is explicitly forbidden by the conditions of use (like on IMDB) on many websites. And if nothing is said on this subject in conditions of use, it doesn't mean that you can screen scrape those websites.

The fact that the information is available publicly on the internet doesn't give you the right to copy and reuse it this way neither.

Why? you may ask. For two reasons:

    Most websites are relying on advertisement and marketing. When people use one of those websites directly, they waste some CPU/network bandwidth of the website, but in response, they may click on an ad or buy something sold on the website. When you screen scrape, your bot waste their CPU/network bandwidth, but will never click on an ad or buy something.

    Displaying the information you screen scrapped on your website can have even worse effects. Example: in France, there are two major websites selling hardware. The first one is easy and fast to use, has a nice visual design, better SEO, and in general is very well done. The second one is a crap, but the prices are lower. If you screen scrape them and give the raw results (prices with links) to your users, they will obviously click on the lower price every time, which means that the website with pretty design will have less chances to sell the products.

    People made an effort in collecting, processing and displaying some data. Sometimes they paid to get it. Why would they enjoy seeing you pulling this data conveniently and for free?

Source: http://programmers.stackexchange.com/questions/141403/improving-performance-for-web-scraping-code/141406#141406

Friday, 22 May 2015

How to prevent getting blacklisted while scraping

Crawlers can retrieve data much quicker and in greater depth than human searchers, so bad scraping practices can have some impact on the performance of the site.

Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a under powered server would have a hard time keeping up with requests from multiple crawlers.

Since spiders don’t bring direct organic traffic and seemingly affect the performance of the site, most site admins hate spiders and do their best to prevent them.

Lets go through how websites detect and block spiders and also know the techniques to overcome those barriers.

Most websites don’t have anti scraping mechanisms since it would affect the user experience, but some sites do not believe in open data access.

Before going through this article always keep in mind that

    A GOOD SPIDER MUST OBEY A WEBSITE’S CRAWLING POLICIES.

HOW DOES DETECTING ‘SPIDER ACTIVITY’ WORK?

A web server can use different mechanisms to detect a spider from a normal user. Here are some methods used by a site to detect a spider:

•    Unusual traffic/high download rate especially from a single client/or IP address within a short time span raises a bot alert.

•    Repetitive tasks done on website based on an assumption that a human user won’t perform the same repetitive tasks all the time.

•    The site has honeypot traps inside their pages, these honeypots are usually links which aren’t visible to a normal user but only to a spider . When a scraper/spider tries to access the link, the alarms are tripped.

Spend some time and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

EASIEST WAY TO FIND IF A SITE HATES BOTS

Check the robots.txt file if it contains line like these, It means the site doesn’t like bots. However, since most sites want to be on Google (arguably the largest scraper of websites globally ;-)) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

Another way is CAPTCHAs irritating presence in the sites other than in authentication page.

WHAT HAPPENS WHEN YOU GET BANNED

There are two ways to ban a webspider, either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent. Temporary blocks can last minutes or hours.

HOW DO WE KNOW A SITE HAS BLOCKED US?

If any of the following symptoms appear on the site that you are crawling, it is a sign of being blocked or banned.

•    Showing CAPTCHA pages
•    Unusual content delivery delay
•    Frequent response with 404,301,500 errors,

also frequent appearance of these status codes are also indication of blocking.

•    401 Unauthorized
•    403 Forbidden
•    404 Not Found
•    408 Request Timeout
•    429 Too Many Requests

WEB CRAWLING BEST PRACTICES

These are the best practices we can follow to overcome the detection.

1. MAKE CRAWLING SLOWER, DO NOT DDoS THE SERVER, TREAT THEM NICELY

Use auto throttling mechanisms, which will automatically throttle crawling speed based on the load on both spider and the website, you are crawling and also adjust the spider to optimum crawling speed. The faster you crawl, the worse it is for everyone.

Put some random sleeps in between requests, Add some delays after crawled number of pages. Choose the lowest number of concurrent requests possible. These techniques make the spider looks like a human being.

2. DISGUISE YOUR REQUESTS BY ROTATING IP/PROXY

A server can easily detects a bot by checking the requests from a single IP address, So we use different IPs for making request to a server and detection rate become lesser. Make a pool of IPs that you can use and use random ones for each request.

There are several methods can be used to change the IP. Services like VPN ,shared proxies, TOR can help and some third parties are also provides services for IP rotation.

3. USER-AGENT SPOOFING

Since every request made from a client end contains a user-agent header ,Using the same useragent multiple times leads to the detection of a bot. User agent spoofing is the best solution for this. Spoof the User agent by making a list of user agents and pick a random one for each request.

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1; (http://www.google.com/bot.html)

You can check your user-agent string here:

http://www.whatsmyuseragent.com/

A good user-agent string list can be found here:

http://www.useragentstring.com/pages/useragentstring.php

4. BE AWARE OF HONEYPOTS

Some site designers put honeypot traps inside websites to detect web spiders, They may be links that normal user can’t see and a spider can.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

5. DO NOT ALWAYS FOLLOW THE SAME CRAWLING PATTERN

Only robots follow the same crawling pattern,Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding pattern in their actions. Humans wont perform repetitive tasks a lot of times. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human client.

6. ALWAYS RESPECT THE robots.txt

All web spiders are supposed to follow rules that you place in a robots.txt file in a website, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.

Source: http://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

Wednesday, 20 May 2015

How Web Data Extraction Services Impact Startups

Starting a business has its fair share of ebbs and flows – it can be extremely challenging to get a new business off the blocks, and extremely rewarding when everything goes according to plan and yields desired results. For startups, it is important to get the nuances of running a business right from day one. To succeed in an immensely competitive space, startups need to perform above and beyond expectation right from the start, and one of the factors that can be of great help during the growing years of a startup is web data extraction.

Web data extraction through crawling and scraping, a highly efficient information gathering process, can be used in many creative ways to bring about major change in the performance graph of a startup. With effective web data extraction services acquired by outsourcing to a reputed company, the business intelligence gathered and the numerous possibilities associated with it, web crawling and extraction services can indeed become the difference maker for a startup, propelling it to the heights of success.

What drives the success of web data extraction?

When it comes to figuring out the perfect, balanced web data collection methodology for startups, there are a lot of crucial factors that come into play. Some of these are associated with the technical aspects of data collection, the approach used, the time invested, and the tools involved. Others have more to do with the processing and analysis of collected information and its judicious use in formulating strategies to take things forward.

Web Crawling Services & Web Scraping Services

With the advent of highly professional web data extraction services providers, massive amounts of structured, relevant data can be gathered and stored in real time, and in time, productively used to further the business interests of a startup. As a new business owner, it is important to have a high-level knowledge of the modern and highly functional web scraping tools available for use. This will help to utilize the prowess of competent data extraction services. This in turn can assist both in the immediate and long-term revenue generation context.

Web Data Extraction for Startups

From the very beginning, the dynamics of startups is different from that of older, well-established businesses. The time taken by the new business entity in proving its capabilities and market position needs to be used completely and effectively. Every day of growth and learning needs to add up to make a substantial difference. In this period, every plan and strategy, every execution effort, and every move needs to be properly thought out.

In such a trying situation where there is little margin for error, it pays to have accurate, reliable, relevant and actionable business intelligence. This can put you in firm control of things by allowing you to make informed business decisions and formulate targeted, relevant and growth oriented business strategies. With powerful web crawling, the volume of data gathered is varied, accurate and relevant. This data can then be studied minutely, analyzed in detail and arranged into meaningful clusters. With this weapon in your arsenal, you can take your startup a long way with smart decisions and clever implementations.

Web data extraction is a task best handled by professionals who have had rich experience in the field. Often, in-house web scraping teams are difficult to assemble and not economically viable to maintain, especially for startups. For a better solution, you can outsource your web scraping needs to a reliable web data extraction service for data collection. This way, you can get all the relevant intelligence you need without overstraining your workforce or having to employ additional personnel to handle web scraping. The company you outsource your work to can easily scrape data from multiple sources as per your requirements, and furnish you with actionable business intelligence that can help you take a lead in a competitive market.

Different Ways for Startups to use Web Data Extraction

Web scraping can be employed for many different purposes to yield different kinds of relevant data that generate actionable insights. For a startup, the important decision is how to use this powerful technique to provide valuable information that can make a difference for the future prospects of the company. Here are some interesting possibilities when it comes to impactful web data extraction for startups –

Fishing for Social Rankings and Backlinks

One of the most important business processes for a startup is competition analysis. This is one area where web data extraction can come across as an invaluable enabler. In the past, many startups have effectively used web scraping to fish for backlinks and social rankings related to competing companies.

Backlinks are important to reach a greater mass of better-targeted audiences, which can go on to increase customer base with minimal efforts. Social ranking is also an immensely important factor, as social actions on the internet are building blocks of opinion and reputation generation in this day and age. Keeping this in mind, you can use web data extraction to scrape for social rankings and backlinks related to content generated by your competing companies. After careful analysis, it is possible to arrive at concrete conclusions regarding what your competitors are doing well, and what sells the best.

This information is gold for marketers and sales personnel, and can be used to discern exactly what needs to be done to increase social buzz, generate favorable opinion, and win over customers from your competitors. You can also use this technique to develop high authority backlinks that help with SEO, targeted reach and organic traffic for your business website. For competition analysis, web scraping is a formidable tool.

Sourcing Contact Information

Another important aspect of business that startups can never ignore is good networking. Whether it is with customers, prospective customers, industry peers, partners, or competitors, excellent networking and open, transparent communication is essential for the success of your startup. For effective communication and networking, you need a large, solid list of contact information pertaining to your exact requirements.

Scraping data from multiple web sources gives you the perfect method of achieving this. With automated, fast web scraping, you can in a short time collect a wealth of important contact information that can be leveraged in many different ways. Whether it is the formation of lasting business relationships or making potential customers aware of what you have on offer, this information has the power to propel your startup to new levels of recognition.

For Ecommerce

If you sell your products and services online and want to stay on top of the competition when it comes to variety, pricing analysis, and special deals and offers, web scraping is the way to go. For many e-commerce startups, the problem of high CTR and low conversion is a stumbling block to higher bottom lines. To remedy problems like these and to ensure better sales, it is always a good idea to have a clear insight about your competition.

Future of Retail Industry

With web data extraction, you can be always aware of what competing companies are doing in terms of pricing strategies, product diversity and special customer offers. By considering that information while evaluating and cementing your own strategies, you can always ensure that you provide better value and range of products and services than your competitors, and therefore stay ahead of the competition.

For Marketing, Brand Promotion and Advertisement

For startups, the first wave of promotion and marketing is the one that holds the key to your long-term business success. It is during this phase that the first and most important public perception of your company is formed, and the rudiments of public opinion start taking shape. For this reason, it is crucial to be on point with your marketing and promotion during the early, formative years of your business.

To achieve this, you need a clear, in-depth understanding of your target audience. You need to categorize your target audience on the basis of many factors like age, gender, demographics, income groups and tastes and preferences. Such detailed understanding can only be possible when you have a large wealth of social data pertaining to your target audience. There is no better way of achieving this than by web data extraction.

Love your brand

With the help of data extraction services, you can gather large chunks of relevant data regarding your target audience which can help you accurately evaluate the potential of each prospective customers as a possible addition to your business family. To ensure that you have a steady, early wave of customers to take your business off the blocks at a rapid pace, you need to devise marketing campaigns, promotional strategies and advertisements in accordance with the customer knowledge you drive through your web scraping efforts. This is a foolproof strategy to have marketing and promotional plans in place that achieve goals, bring in new business and provide your company with enough initial momentum to carry it through the later years of success.

To conclude, web data extraction can be a veritable tool in the hands of a startup. With the proper use and leveraging of this technique, your startup can gather the required business intelligence to shine in a competitive market and become a favorite with the customer base. Working with the right web data extraction company can be one of the most important business decisions you make as a startup owner.

Source: https://www.promptcloud.com/blog/web-data-extraction-services-for-startups/

Sunday, 17 May 2015

What is Blog Scraping Service?

Blog scraping is the one of the best service to increase the traffic of the site by commenting about blogs or writing review about blogs in SEO field. Most of the Blogs will allow their reader to review or write their own comments or suggestion or ideas or thoughts in the blogs.

Nowadays in the internet world we can find the number of blogs and sites related to various topics or various products. Main concept of this service is increase traffic of website by commenting others blogs. This is very simple and easiest method. But the main difficultly we face here is getting approval from moderator of the site which may take more time or maybe we won’t get the approval.

Hence Web Scraping seo is planning to provide this blog scraping service without approval as many moderators do not have the time to read and approved each and every comment written by various visitors. We will find the High PR pages on the various blogs related to your website content and write the own comment about those blogs and provide the link of your website or anchor text. We don’t have the option or the way to track the blogs whether it is approved by moderator or not. We will give the link with comments what we have typed on the blogs as a report. It will increase the back link and increase the traffic.

What are the features of Blog scraping Service?

•    Will provide the comments or reviews to blogs which having related niche to your product.
•    Will write comments only high density or high ranking blogs.
•    Fast and More accurate promotion compared to other service
•    Understand the Blogs by reading carefully and comment accordingly
•    This service is optimized and SEO friendly.

What are the benefits of Blog scraping Service?

•    Effect of time spending for this service is very less.
•    This service is best method to increase your site traffic with minimal effect and cost.
•    Increase your web site rank in all search engines.
•    Reach your site to more number of audiences.
•    Increase your product sale.
•    Fast and more results.

What are the advantages of using this service in Web Scraping SEO?

•    Web Scraping SEO is one the top SEO service provider in the SEO Market.
•    Expert people working on Blog commenting service will always do analysis to find the high traffic blogs.
•    Web Scraping SEO will get the approval from Blogs administrator easily.
•    Provides High Quality Service with reasonable price.
•    Provides on time delivery.
•    More flexible to clients.
•    Always met the Client expectation and Provide quality service.

Frequently Asked Questions

Q: Will you provide the approval for each comment you typed on the blogs from blogsite moderator?

A: No, we are only responsible for creating comments for your website but we won’t wait for moderation approval, because Moderator is responsible for Approval, He may take time for approval that is according to Moderator’s scope. We will give only the blog links and the comments to you as a report.

Q: Do you have any system or software to track the approval of blog?

A: We don’t have any system or software to track the approval, we do comments in those top blog sites according to the matching keyword. That is only our job approval is from moderator side.

Q: Why you can’t get the approval for comments from moderator?

A: I can clearly answer this one, Because nowadays everyone is busy particularly the blogsite Moderators for that reason our comments got approved late. But we are not going to wait for that because we have a lot of works to do, But I assure you, that with the final reports that contains how many sites we have uploaded with your comments in MS Excel format will reach you.

Q: How do you select the blogs for commenting?

A: We are going to select top ranking blog sites related to your keywords, According to the benefits of your product we will give proper and attractive comments carefully.

Source: http://www.Web Scrapingseo.com/blog-scraping-service.aspx

Thursday, 7 May 2015

Web Scraping: Startups, Services & Market

I got recently interested in startups using web scraping in a way or another and since I find the topic very interesting I wanted to share with you some thoughts. [Note that I’m not an expert. To correct me / share your knowledge please use the comment section]

Web scraping is everything but a new technique. However with more and more data shared on internet (from user generated content like social networks & review websites to public/government data and the growing number of online services) the amount of data collected and the use cases possible are increasing at an incredible pace.

We’ve entered the age of “Big Data” and web scraping is one of the sources to feed big data engines with fresh new data, let it be for predictive analytics, competition monitoring or simply to steal data.

From what I could see the startups and services which are using “web scraping” at their core can be divided into three categories:

•    the shovel sellers (a.k.a we sell you the technology to do web scraping)

•    the shovel users (a.k.a we use web scraping to extract gold and sell it to our users)

•    the shovel police (a.k.a the security services which are here to protect website owners from these bots)

The shovel sellers

From a technology point of view efficient web scraping is quite complicated. It exists a number of open source projects (like Beautiful Soup) which enable anyone to get up and running a web scraper by himself. However it’s a whole different story when it has to be the core of your business and that you need not only to maintain your scrapers but also to scale them and to extract smartly the data you need.

This is the reason why more and more services are selling “web scraping” as a service. Their job is to take care about the technical aspects so you can get the data you need without any technical knowledge. Here some examples of such services:

    Grepsr
    Krakio
    import.io
    promptcloud
    80legs
    Proxymesh (funny service: it provides a proxy rotator for web scraping. A shovel seller for shovel seller in a way)
    scrapingHub
    mozanda

The shovel users

It’s the layer above. Web scraping is the technical layer. What is interesting is to make sense of the data you collect. The number of business applications for web scraping is only increasing and some startups are really using it in a truly innovative way to provide a lot of value to their customers.

Basically these startups take care of collecting data then extract the value out of it to sell it to their customers. Here some examples:

Sales intelligence. The scrapers screen marketplaces, competitors, data from public markets, online directories (and more) to find leads. Datanyze, for example, track websites which add or drop javascript tags from your competitors so you can contact them as qualified leads.

Marketing. Web scraping can be used to monitor how your competitors are performing. From reviews they get on marketplaces to press coverage and financial published data you can learn a lot. Concerning marketing there is even a growth hacking class on udemy that teaches you how to leverage scraping for marketing purposes.

Price Intelligence. A very common use case is price monitoring. Whether it’s in the travel, e-commerce or real-estate industry monitoring your competitors’ prices and adjusting yours accordingly is often key. These services not only monitor prices but with their predictive algorithms they can give you advice on where the puck will be. Ex: WisePricer, Pricing Assistant.

Economic intelligence, Finance intelligence etc. with more and more economical, financial and political data available online a new breed of services, which collect and make sense of it, are rising. Ex: connotate.

The shovel police

Web scraping lies in a gray area. Depending on the country or the terms of service of each website, automatically collecting data via robots can be illegal. Whatever the laws say it becomes crucial for some services to try to block these crawlers to protect themselves. The IT security industry has understood it and some startups are starting to tackle this problem. Here are 3 services which claim to provide solutions to stop bots from crawling your website:

•    Distil
•    ScrapeSentry
•    Fireblade

From a market point of view

A couple of points on the market to conclude:

•    It’s hard to assess how big the “web scraping economy” is since it is at the intersection of several big industries (billion dollars): IT security, sales, marketing & finance intelligence. This technique is of course a small component of these industries but is likely to grow in the years to come.

•    A whole underground economy also exists since a lot of web scraping is done through “botnets” (networks of infected computers)

•    It’s a safe bet to say that more and more SaaS (like Datanyze pr Pricing Assistant) will find innovative applications for web scraping. And more and more startups will tackle web scraping from the security point of view.

•    Since these startups are often entering big markets through a niche product / approach (web scraping is a not the solution to everything, there are more a feature) they are likely to be acquired by bigger players (in the security, marketing or sales tools industries). The technological barrier are there.

Source: http://clementvouillon.com/article/web-scraping-startups-services-market/

Thursday, 30 April 2015

Earn Money From Price Comparison Through Web Scraping

Many individuals discover the pot of gold just within their reach. They have realized that there is money in the web. Cyber technology has blessed mankind with so many benefits that makes money very possible by just some clicks on the mouse and keyboard. Building a price comparison website is an effective way of helping clients find their desired products while you as the owner earn money at the same time.

Building price comparison websites

There is indeed much money in building price comparison websites but it is not an easy task especially for a novice in maintaining a website of one’s own. Since this entails some serious programming and ample familiarity with data feeds, you have to have a good working plan. In addition, what you are venturing into is greater than the usual blogs about just anything that you can think of. Furthermore, you are stepping into the vast field of electronic marketing, therefore you must be ready.

The first point of consideration is to identify which products or services are you going to include in your website. Choose a product or service that you and a majority of clients are mostly interested in. Suppose you want you to choose sports as your theme then you can include items and prices of sports gear, clothing such as uniforms, training videos, books, and other safety stuff. You need to do some research and even a survey to determine whether the goods and services you are promoting on your website are in demand and are what most people want to know. Moreover, it is on this stage that you may need the help of experts and veterans in the field of building to be assured that you are on the right track.

In addition, be willing to change in case your chosen category is not gaining readership or visitors. Then evaluate whether you need to expand or to be more specific in your description of the products and the comparison of the prices. Make your site prominent by search engine optimization (SEO) and make sure to acknowledge also that not too many people visit a site that is not free.

Helping visitors choose the best product/services

Good marketing strategy starts with knowing who your target audience are. There is indeed a need to do a lot of planning and research in order to understand your client’s needs and preferences. Moreover, knowing them thoroughly leads to achieving 100% consumer satisfaction. When you have provided everything they need to know about certain products, they would not need to seek elsewhere which will also gain you more regular visitors. Remember that your audience are members of communities and social networks such that there is a great possibility that they would spread the word around about the good services you are offering.

If there is a need to conduct a survey in addition to research, you should resort to it. In this manner you can discover what goods and services are not yet completely exhausted by the other websites or web creators. Ample knowledge about your potential visitors and consumers will surely make you effectively provide them with adequate statistics for their needs.

Your site will then look like a complete guidebook for them that will give them the best value for their money. Therefore, it must be thoroughly filled with product details, uses, options, and prices.

Making money as affiliate of eCommerce websites

Maintaining a price comparison website gives you less worry about getting paid or having your products bought and sold because income comes in through advertising and affiliate sales. Affiliate marketing is a way of earning money online by serving as a publisher for promotion of products, services or sites of businesses. The affiliate receives rewards from businesses for each visitor or client that comes to the business website or buys its product through the efforts of the advertising and promotion that is made by the affiliate. This is the online version of the concept of agent or referral fee sales channel. Aside from website owners, bloggers as well as members of community forums can also serve as affiliates. The affiliate earns money in three ways: through pay per link; pay per sale and pay per lead.

Trust in the reliability of the product - You should have a personal belief or confidence in the product you are promoting not only because it makes you sound more convincing, but also because you need to maintain your clients and establish credibility in your blog or website. In other words, don’t just pick any product. If you cannot use them personally, they should at least have several positive reviews and no negative ones.

Maintain credibility with readers and fellow bloggers - Befriend your readers and your co-bloggers by answering their queries sincerely and quickly. Your friendly attitude can win you their trust which is a very vital element of affiliate marketing.

Do reviews - In addition to publishing price comparison, you can gain more visitors by writing about the product and do proper SEO (Search Engine Optimization). So the expected happens, the more prominent the product becomes online, the higher will be your income.

Link with friends thru social media - Your friends have friends and their friends have also friends. Just think of how powerful your social media site can be when you post your link on your account on Facebook, Twitter or MySpace and others. Since trust is built on friendship, it is easy to get clients from among your friends and their friends.

Overall, you get all pertinent information about certain products through web data mining or web scraping. All you need to do is to be keen to the needs of your clients and use web content extraction efficiently.

Source: http://www.loginworks.com/earn-money-price-comparison-web-scraping/

Tuesday, 28 April 2015

Web Scraping – An Illegal Activity or Simple Data Collection?

Gone are the days when skillful extraction of information pertaining to real estate such as foreclosures, homes for sale, or mortgage records was considered difficult. Now, it is not only easy to extract data from real estate websites but also scrape real estate data on a consistent basis to add more value to your portal, or ensure that updated data is available to your visitors at all times. From downloading actual scanned documents in the form of PDF files to scraping websites for deeds or mortgages, smartly designer data extraction tools can do it all.

However, the one question that still manages to come to the front in the minds of those who scrape real estate listings and others are whether the act is illegal in nature or a simple way of collecting data.

Take a look.

Web Scraping—What is it?

Generally speaking, web scraping refers to programs that are designed to simulate human internet surfing and access websites on behalf of their users. These tools are effective in collecting large quantities of data that are otherwise difficult for end users to access. They process semi-structured or unstructured data pages of targeted websites and transform available data into a more structured format that can be extracted or manipulated by the user easily.

Quite similar to web indexing that is used by search engines, the end motivation of web scraping is much different. While web indexing makes search engines far more efficient, the latter is used for reasons like market research, change detection, data monitoring, or in some events, theft. But then, it is not always a bad thing. You just need to know if a website allows web scraping before proceeding with the act.

Fine Line between Stealing and Collecting Information

Web scraping rides an extremely fine line between the acts of collecting relevant information and stealing the same. The websites that have copyright disclosure statements in place to protect their website information are offended by outsiders raiding their data without due permission. In other words, it amounts to trespassing on their portal, which is unacceptable—both ethically and legally. So, it is very important for you to read all disclosure statements carefully and follow along in the right way. As web scraping cases may turn into felony offenses, it is best to guard against any kind of scrupulous activity and take permission before scraping data.

The Good News

However, all is not grey in data extraction processes. Reputed agencies are helping their clients scrape valuable data for gaining more value through legal means and carefully used tools. If you are looking for such services, then do get in touch with a reliable web scraping company of your choice and take your business to the next levels of success.

Source: https://3idatascraping.wordpress.com/2015/03/11/web-scraping-an-illegal-activity-or-simple-data-collection/

Sunday, 26 April 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?

scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

Source: https://blog.hartleybrody.com/web-scraping/

Wednesday, 22 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Saturday, 18 April 2015

Data Mining and Predictive Analysis

Data collection and curing is the core foundation of most businesses. Database building thus is an important function and activity where enterprises invest heavily. With information now available on the Internet and easily obtained, it raises the importance of having professionals who crawl data and offer web scraping services.

Once the data is accessed, though, it is important to filter out the relevant data based on the business need. Although Many DaaS provider convert the unstructured web data into meaningful structured data it is recommended to be internally equipped to use the data to its maximum.

This understanding has given rise to the field of Data Mining. Data Mining is designed to explore large amounts of data in search of consistent patterns and connections between the variables and validate the findings by applying the detected patterns to the new sets of the data. Once these connections are established and understood, the end goal is to be able to predict the possible outcomes using predictive analysis techniques.

Together, both Data Mining and predictive analysis aid in making marketing campaigns more efficient. While predictive analysis helps simulate and understand what may happen, data mining helps identify exciting data patterns and connections.

The process of Data Mining and Predictive analysis consists of 3 steps

Exploration

Once a database is compiled, it needs to be cleaned, analysed and potential connections need to be built. This process involves filtering the relevant data and identifying the possible predictors. Data Exploration also sets a premise for preliminary feature selection to manage number of variables. This data is then prepared for statistical analysis using a wide variety of graphical and statistical parameters. This helps identify the most relevant variables and setups the predictive models to be built.

Data mining process

Validation
Next comes building various models and choosing the most relevant ones. This decision is based on their possible predictive performance and of being able to produce stable results across all the samples. Simple as it sounds, to truly get the results, all possible models must be treated with data to simulate scenarios. The model with most stable statistical feature is validated.

Application

Once the relevant models are finalised, the same is applied to new data to understand and predict the estimated outcomes. Application of data models is an ongoing and complex process since every new dataset needs to be configured in the model.

Data Mining and predictive analysis essentially involves blending statistical methodology where the traditional statistics machine learning and complex algorithms. This greatly increases the need for efficient and skilled data handlers. This could include data analysts and scientists.

See how you can become data scientist here:

Data crunchers use data mining and predictive analysis actively to get an edge in the big data management. Database platforms like Hadoop assist in database management and large-scale distribution. But the costs involved in setting up data centres and big data management capacity are high. Budgets allocated within the enterprise are more project-focussed and analytics budgets are usually limited. Quite often, big data and analytics project fail to launch because of this problem! The other problem is that to run effective predictive models, data requires to be handled by scientists with experience. Finding and setting together a technologically-advanced team is a daunting task most enterprises face outside the tech domain.

Predictive Analysis model

A predictive analysis model is essentially predicting the all possible outcomes from a given set of data. Here are a few steps that can be taken to help build and identify the “ideal” predictive analysis model. These steps more or less mirror the usual statistical methodology of building a test model.

Defining an objective

This is the first and a critical step. Unless the objective is identified and defined there can be no concrete results since there wouldn’t be clarity to compare the final outcome to the expected result. It also helps understand the scope of the project.

Preparing the data

This is more to do with data mining. Historic data used for training the model is scattered across multiple platforms and sources. To compound the problem, data can be unstructured with possible duplicate accounts and missing values! Data quality determines the quality of the model, and thus it becomes imperative that data is healthy and relevant.

Data Sampling

Once mined, Data is essentially split into 2 parts. One set is for training that is used to build the model and the second is the ‘test’ set that is used to verify the accuracy of the final output. This also helps identify and filter the noise component.

Model Building

Sampling cam equally result in a single algorithm or parallel & connected algorithms. In such a case the data goes through multiple testing and a decision is based on the final output.

Execution

Once a model gets finalised, the other teams in the organization need to be involved to build a deployable model and understand its impact on the overall business.

The possibilities with Data mining & Predictive analysis are huge. It also gives a huge room for learning and experimenting. There are several tools available in the industry to aid through all the steps of data mining and predictive analysis. The combination of human expertise and intellect along with the help of the available tools and the overall cooperation within the multiple channels within the organization essentially ensures a stronger grip on the ability to build a solid predictive model.

When used together, predictive analytics and data mining help marketing professionals anticipate and get ready for customer needs, rather than just reacting to them.

Source: https://www.promptcloud.com/blog/data-mining-and-predictive-analysis/

Tuesday, 7 April 2015

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

html_node("strong span") %>%

html_text() %>%

as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

html_nodes("#titleCast .itemprop span") %>%

html_text()

#> [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"

#> [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#> [7] "Charlie Day"     "Amanda Farinos" "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"

#> [13] "Morgan Freeman" "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

html_nodes("table") %>%

.[[3]] %>%

html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1... marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

•    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

•    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

•    Detect and repair text encoding problems with guess_encoding() and repair_encoding().

•    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/