Wednesday, 27 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

Using Perl to scrape a website


I am interested in writing a perl script that goes to the following link and extracts

the number 1975: https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego

%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego

County, California in 1940. I am trying to do this in a loop structure to generalize

over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975

should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=

%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%

%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace

%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer.

The API does not apply to that part of the webpage. Hence, any solution pertaining to

JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Tuesday, 26 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R


I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Monday, 25 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data


I am interested in obtaining data from different reddit subreddits. Does anyone know if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API (follow the rules)
    automatically generated API docs -- provides information on the requests needed to access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you should check out the existing set of API wrappers for various languages. Despite my bias (I am the package maintainer) I am quite certain PRAW, for python, has support for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Tuesday, 19 August 2014

What is the right way of storing screen-scraping data?


i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top

most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return

Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60

posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a

few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python

the best?

What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML

documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.

Web scraping is a computer software technique of extracting information from websites.

Usually, such software programs simulate human exploration of the Web by either

implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-

fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web

content using a bot and is a universal technique adopted by most search engines. In

contrast, Web scraping focuses more on the transformation of unstructured Web content,

typically in HTML format, into structured data that can be stored and analyzed in a

central local database or spreadsheet. Web scraping is also related to Web automation,

which simulates human Web browsing using computer software. Uses of Web scraping include

online price comparison, weather data monitoring, website change detection, Web research,

Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example,

Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a

module for Python, makes web scraping really, really easy.

Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal

preference towards Perl when it comes to data extraction, because the ease at which you

can use regex, technically Python is no less. However based on the active community,

Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to

use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based

pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install

GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for

the job.

If you are looking into web scraping with python, you may also want to look at lxml

package. It has same ElementTree API as stdlib parser, but is much more powerfull and

fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping

library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-

the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python the best?



    What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.



Web scraping is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example, Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a module for Python, makes web scraping really, really easy.




Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal preference towards Perl when it comes to data extraction, because the ease at which you can use regex, technically Python is no less. However based on the active community, Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for the job.



If you are looking into web scraping with python, you may also want to look at lxml package. It has same ElementTree API as stdlib parser, but is much more powerfull and fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

web scraping groupon


i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting.

also if anyone could suggest a framework or library in php which makes scraping easy it would be great.


I would investigate the cURL library for grabbing website content. I'm not sure on the exact information you want to scrape, or if the refresh will cause an issue, but hopefully this launches your attempt.

We use iMacros. PRO: Works in browser, works with any website. CON: Not as fast as CURL. - of course, nothing stops you from using both.



Must you stick with PHP for the scraping? TestPlan makes this type of testing easy. You can either access the page again, or simply use TestPlan to sign up for their email list to gain extended access to their site.

Here's a rough example that takes you to the main page and closes the little popup:

GotoURL http://www.groupon.com/
Click id:step_one

SubmitForm with
    %Params:subscription[email_address]% somewhere@test.domain.xx
end


They have an API http://www.groupon.com/pages/api if that helps.


Source: http://stackoverflow.com/questions/3843733/web-scraping-groupon

Getting Different Results For Web Scraping


I was trying to do web scraping and was using the following code :

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

link_dictionary = {}
soup = BeautifulSoup(htmltext)

for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text

        print articletext

I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?


READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING

If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.

For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.

If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.

Cited here

from selenium import webdriver
from bs4 import BeautifulSoup
import time   

driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)

soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium     
...

driver.close()

And the output for me is 8

Source: http://stackoverflow.com/questions/19918153/getting-different-results-for-web-scraping

Monday, 18 August 2014

How to prevent data-scraping a valuable data web service?

I have a great idea for a windows store app. I'd like to make this app. However it requires a large and valuable database that I will need to create a service for so that people cannot easily steal it. My thinking is maybe host a mobile service on Azure (which I've never tried) and create a .net Web API project to take requests and dish out Json like candy to a windows 8 mvvmclient. However what I don't want is someone sniffing my traffic back and forth from app to service and figuring out how to get/post data from using my app and service then setting up their own app / website to display this data using my bandwidth to make them money.

How can I protect my app-to-db data access so it can't be reverse engineered on me.

Also is this the best setup for developing a high volume windows 8 app like this? Do you have a better suggestion?

EDIT: I know I can use SSL etc to encrypt traffic to and from. What I am trying to protect is someone using Firebug or Fiddler to figure out what parameters can be posted to get a particular record back. Then creating their own site that simply uses my service as the end point and siphons my data and whores my bandwidth. ie. Just using firebug I know I can use https://www.google.com/search?q=dallas to search the word dallas on google. Even if I encrypt the page, they can see that much in their browser. so if someone does the same get/post in their own application they would get the same records back thus using my stuff.

3 Answers

The most straight forward thing you can do is to setup authentication for your users using something like OAuth. This will allow you to ensure no communication happens with your service in an anonymous fashion.

Once you have authenticated your requests you can place controls on those requests that won't impact a normal user. You could rate limit or throttle requests or any number of tactics to make it very expensive time wise to siphon off large portions of your data set.

For instance, you can start blocking requests when you notice a large number of users clustering from a single IP address. You could place sensible limits on each user (like 10 API calls per minute with a result set limited to 50). You get the idea I'm sure.

I think we met the same concern. I'm developing a windows 8 application which is contacting a web service built on top of Windows Azure Web Site. I don't want the bad guy fire some fake requests to my service by intercepting the traffic through some tools like Fiddler.

I asked this question in a mail group and got a tip. I've never tried but just for your information. If your application needs user login, then the user's password is a good seed for data/traffic protection. You can use the password to generate a key-pair, sign the request and send it to server as well as the public key. Then on the server side it can verify the sign by the public key.

Use HTTPS is another approach. But as you know, a bad guy can also know the actual data through Fiddler even though HTTPS.

Use certificate might be another solution I think. But I didn't find the relevant document on how to install and pick a certificate from client's machine.

HTH

just serve it over HTTPS, then they can't sniff it.

Source:http://stackoverflow.com/questions/14350298/how-to-prevent-data-scraping-a-valuable-data-web-service

Monday, 11 August 2014

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

Data scrape is the process of extracting data from web by using software program from proven website only. Extracted data any one can use for any purposes as per the desires in various industries as the web having every important data of the world. We provide best of the web data extracting software. We have the expertise and one of kind knowledge in web data extraction, image scrapping, screen scrapping, email extract services, data mining, web grabbing.

Who can use Data Scraping Services?


Data scraping and extraction services can be used by any organization, company, or any firm who would like to have a data from particular industry, data of targeted customer, particular company, or anything which is available on net like data of email id, website name, search term or anything which is available on web. Most of time a marketing company like to use data scraping and data extraction services to do marketing for a particular product in certain industry and to reach the targeted customer for example if X company like to contact a restaurant of California city, so our software can extract the data of restaurant of California city and a marketing company can use this data to market their restaurant kind of product. MLM and Network marketing company also use data extraction and data scrapping services to to find a new customer by extracting data of certain prospective customer and can contact customer by telephone, sending a postcard, email marketing, and this way they build their huge network and build large group for their own product and company.

We helped many companies to find particular data as per their need for example.

Web Data Extraction

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. We help you to create a kind of API which helps you to scrape data as per your need. We provide quality and affordable web Data Extraction application

Data Collection

Normally, data transfer between programs is accomplished using info structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all. That's why the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user.

Email Extractor

A tool which helps you to extract the email ids from any reliable sources automatically that is called a email extractor. It basically services the function of collecting business contacts from various web pages, HTML files, text files or any other format without duplicates email ids.

Screen scrapping

Screen scraping referred to the practice of reading text information from a computer display terminal's screen and collecting visual data from a source, instead of parsing data as in web scraping.

Data Mining Services

Data Mining Services is the process of extracting patterns from information. Datamining is becoming an increasingly important tool to transform the data into information. Any format including MS excels, CSV, HTML and many such formats according to your requirements.

Web spider

A Web spider is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web Grabber

Web grabber is just a other name of the data scraping or data extraction.

Web Bot

Web Bot is software program that is claimed to be able to predict future events by tracking keywords entered on the Internet. Web bot software is the best program to pull out articles, blog, relevant website content and many such website related data We have worked with many clients for data extracting, data scrapping and data mining they are really happy with our services we provide very quality services and make your work data work very easy and automatic.

Source:http://ezinearticles.com/?How-Web-Data-Extraction-Services-Will-Save-Your-Time-and-Money-by-Automatic-Data-Collection&id=5159023

How Your Online Information is Stolen - The Art of Web Scraping and Data Harvesting

Web scraping, also known as web/internet harvesting involves the use of a computer program which is able to extract data from another program's display output. The main difference between standard parsing and web scraping is that in it, the output being scraped is meant for display to its human viewers instead of simply input to another program.

Therefore, it isn't generally document or structured for practical parsing. Generally web scraping will require that binary data be ignored - this usually means multimedia data or images - and then formatting the pieces that will confuse the desired goal - the text data. This means that in actually, optical character recognition software is a form of visual web scraper.

Usually a transfer of data occurring between two programs would utilize data structures designed to be processed automatically by computers, saving people from having to do this tedious job themselves. This usually involves formats and protocols with rigid structures that are therefore easy to parse, well documented, compact, and function to minimize duplication and ambiguity. In fact, they are so "computer-based" that they are generally not even readable by humans.

If human readability is desired, then the only automated way to accomplish this kind of a data transfer is by way of web scraping. At first, this was practiced in order to read the text data from the display screen of a computer. It was usually accomplished by reading the memory of the terminal via its auxiliary port, or through a connection between one computer's output port and another computer's input port.

It has therefore become a kind of way to parse the HTML text of web pages. The web scraping program is designed to process the text data that is of interest to the human reader, while identifying and removing any unwanted data, images, and formatting for the web design.

Though web scraping is often done for ethical reasons, it is frequently performed in order to swipe the data of "value" from another person or organization's website in order to apply it to someone else's - or to sabotage the original text altogether. Many efforts are now being put into place by webmasters in order to prevent this form of theft and vandalism.

Source:http://ezinearticles.com/?How-Your-Online-Information-is-Stolen---The-Art-of-Web-Scraping-and-Data-Harvesting&id=923976