Web Screen Scraping: May 2013

Thursday 30 May 2013

Choice FMiner to do screen scraping

When Operation is the Only Criteria

There are certain things you need to know about a screen scraper. In the ever-changing world of commerce, efficiency operating the daily operational tasks has become the life blood of any business. In the grand scheme of business, it is the operations that business owners rely most heavily on when the pressure of business begins to mount. Sometimes, however, we take these operations for granted due to the other business-related and time consuming tasks placed on our shoulders during the course of any business day.
There is Help for Businesses Everywhere

When the work just seems to keep piling up it is of the utmost importance that business owners can rely on computer software programs that are minimize the labor involved in performing any ministerial duty. Fortunately for business owners worldwide, screen scraper software serves just that function.
What is Screen Scraper Software?

Screen scraper provides individuals the ability to extract and store specific templates from a wide range of online sources. This process allows the scraped data to be accessed immediately by the user and then made available to be utilized for a variety of functions.
How do I use Screen Scraper?

A screen scraper is developed to be extremely user friendly. The software allows an individual the ability to point and click to data available on various web pages. The user may then open the data in Excel or other text format and use the extracted data to assist them in performing their work-related tasks.
Where Can I Find Screen Scraper?

Screen scraper software can be downloaded instantly onto your computer at fminer.com. This website contains all the information an interested person needs to understand the value of screen scraper software.
What is FMiner

When Operation is the Only Criteria

There are certain things you need to know about a screen scraper. In the ever-changing world of commerce, efficiency operating the daily operational tasks has become the life blood of any business. In the grand scheme of business, it is the operations that business owners rely most heavily on when the pressure of business begins to mount. Sometimes, however, we take these operations for granted due to the other business-related and time consuming tasks placed on our shoulders during the course of any business day.
There is Help for Businesses Everywhere

When the work just seems to keep piling up it is of the utmost importance that business owners can rely on computer software programs that are minimize the labor involved in performing any ministerial duty. Fortunately for business owners worldwide, screen scraper software serves just that function.
What is Screen Scraper Software?

Screen scraper provides individuals the ability to extract and store specific templates from a wide range of online sources. This process allows the scraped data to be accessed immediately by the user and then made available to be utilized for a variety of functions.
How do I use Screen Scraper?

A screen scraper is developed to be extremely user friendly. The software allows an individual the ability to point and click to data available on various web pages. The user may then open the data in Excel or other text format and use the extracted data to assist them in performing their work-related tasks.
Where Can I Find Screen Scraper?

Screen scraper software can be downloaded instantly onto your computer at fminer.com. This website contains all the information an interested person needs to understand the value of screen scraper software.
What is FMiner

Source: http://www.fminer.com/screen-scraping/

Monday 27 May 2013

Are You Screen Scraping or Data Mining?

Many of us seem to use these terms interchangeably but let’s make sure we are clear about the differences that make each of these approaches different from the other.

Basically, screen scraping is a process where you use a computer program or software to extract information from a website. This is different than crawling, searching or mining a site because you are not indexing everything on the page – a screen scraper simply extracts precise information selected by the user. Screen scraping is a useful application when you want to do real-time, price and product comparisons, archive web pages, or acquire data sets that you want to evaluate or filter.

When you perform screen scraping, you are able to scrape data more directly and, you can automate the process if you are using the right solution. Different types of screen scraping services and solutions offer different ways of obtaining information. Some look directly at the html code of the webpage to grab the data while others use more advanced, visual abstraction techniques that can often avoid “breakage” errors when the web source experiences a programming or code change.

On the other hand, data mining is basically the process of automatically searching large amounts of information and data for patterns. This means that you already have the information and what you really need to do is analyze the contents to find the useful things you need. This is very different from screen scraping as screen scraping requires you to look for the data, collect it and then you can analyze it.

Data mining also involves a lot of complicated algorithms often based on various statistical methods. This process has nothing to do with how you obtain the data. All it cares about is analyzing what is available for evaluation.

Screen scraping is often mistaken for data mining when, in fact, these are two different things. Today, there are online services that offer screen scraping. Depending on what you need, you can have it custom tailored to meet your specific needs and perform precisely the tasks you want. But screen scraping does not guarantee any kind of analysis of the data.

Source: http://www.connotate.com/company/blog/138-are_you_screen_scraping_or_data_mining

Friday 24 May 2013

Web Page Screen Scraping

The DataLayer.Web Scraper element "scrapes", or retrieves, content from within web pages and organizes it into the row/column datasets usually found in datalayers. This provides an easy way for developers to automate the collection and use of data published on other web sites or in HTML files. This document discusses techniques for using the datalayer; topics include:

    About Web Page "Scraping"
    Using "Get XPaths" and Web Scraper Table
    Using Web Scraper Rows and Web Scraper Column

About "Web Page Scraping"

"Screen scraping" has been around for a long time; it's the process of extracting data from the text on a display screen, usually based on fields or screen regions. A key issue is that the data is intended to be viewed and is therefore neither documented nor structured for convenient parsing. In the age of the Internet, the term has also come to apply to the same process with web pages. In many cases, complicated parsing code is required.

However, the Logi approach makes the process fairly easy. Generally, in a Logi application, the target HTML web page is read (the entire page, not just the portion displayed), transformed into XML, and then parsed based on XPath identifiers. Data formatted into tables, and even in less-structured blocks of text, can readily be parsed and served up to the report developer in the standard Logi datalayer row/column structure. The data can then be further processed (filtered, grouped, aggregated, etc.) using all the usual datalayer child elements. This is all done using the DataLayer.Web Scraper element and its child elements.

The DataLayer.Web Scraper element is not available in Logi Java applications.

For purposes of convenience, this document refers to "HTML files" but file extensions are not really important; if the file will render to a web page, the datalayer will use it. Pages with .htm, .html, xhtml, .mht and other extensions are all valid.

One of the keys to this process is specifying the XPath parsing identifiers and DataLayer.Web Scraper provides a mechanism for assisting developers in selecting them, as discussed in the next section.

Note that this document illustrates techniques by referring to public web sites. While we make an effort to stay aware of them, these sites may change their coding from time to time without our knowledge, invalidating the actual data and values depicted, so you may not be able to achieve the identical results. The techniques themselves, however, remain valid.

Source: http://devnet.logianalytics.com/rdPage.aspx?rdReport=Article&dnDocID=1139

Friday 17 May 2013

Screen Scraping Tools for Rails Developers

Screen scraping is defined in Wikipedia as "a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox". There are a number of tools available to the Rubyist to accomplish almost any scenario imagined.

I recently had a project where almost the entirety of the application relied on screen scraping due to the unavailability of a public API for the data being made available. So, I dug into RubyGems to see what I could find. There are two main types of gems available: browser emulation and web crawler (for lack of a better term).

Emulators generally follow the Capybara/choose your driver (Selenium,Watir, etc). Most, if not all, of these were developed with the end goal of automating unit/functional/integration tests. They can also be used in conjunction with the headless gem in the event that you don't won't to see the 'browser' on your screen during testing. One variation on this theme is the use of the poltergeist gem in conjunction with PhantomJS. So, poltergeist is a capybara driver that drives navigation via PhantomJS. The advantage of this combination is that part of the gtk libraries have been incorporated in the PhantomJS library so that you have headless browsing without the xvfb library and headless gem.

Web crawlers, on the other hand, implement 'browsing' via an http stack and xml/html parser. The good ones utilize the excellent nokogiri gem to handle parsing. The one obvious missing functionality in this stack is that no access is provided to the DOM of the navigated page. This makes the use of this tool very difficult in an AJAX-laden site. The best of these tools is Mechanize.

I have to say, unequivocally, that you should try everything possible to make Mechanize your chosen tool. Every facet of scraping data with this tool seemed to be about an order of magnitude faster than using the Capybara/xxx combination. Part of the reason for this is by design. The emulated browsers (designed as testing tools) were not necessarily optimized for speed.

The biggest challenge I encountered in building my application was the requirement to download some documents stored by the site I was scraping them and archive them to an Amazon s3 store. I thought this would be a fairly straightforward evolution as Mechanize provides what they call 'pluggable parsers'. This is just a mapping of what Mechanize parser would handle certain mime types encountered on a page. This code block tells Mechanize that whenever the application/pdf mime type is encountered, use the Mechanize::Download class to download the pdf.

    agent = Mechanize.new
    agent.get 'http://samplepdf.com'
    agent.pluggable_parsers.pdf = Mechanize::Download

The pdf document can now be saved using:

    doc = agent.get 'http://samplepdf.com/sample.pdf'
    doc.save(mylocalfilename)

Great, I thought. I'll still be able to use the fast tool. As I delved further into the site I was scraping, though, I discovered that I was not provided a direct link to the pdf's that were to be downloaded. Sloggng through the javascript library revealed that upon clicking the download button, a form was constructed by a function using the
required parameters (document number, case number, preferences, etc). Mechanize::Form to the rescue...

    params = @agent.page.at('form').attributes["onsubmit"].value.match(/$(.*?)?$/)[1].split(",").each{|e| e.gsub!(/'/,"")}

    builder = Nokogiri::HTML::Builder.new do |doc|
    doc.form_ :enctype=>"multipart\/form-data", :method=>"POST", :action=>params[0], :id=>id" do
      doc.input :type=>"hidden", :name=>"caseid", :value => params[1]
        doc.input :type=>"hidden", :name=>"de_seq_num", :value => params[2]
        doc.input :type=>"hidden", :name=>"got_receipt", :value => params[3]
      end
    end
    ------------------------------------------------------------------------------------------------------------------------------
    node = Nokogiri::HTML(builder.to_html)
    f2 = Mechanize::Form.new(node.at('form'), @agent, @agent.page)
    doc = f2.submit #submit the newly built form

    #sometimes the document was being loaded into an iframe
    if doc.is_a?(Mechanize::Page)
      src = doc.at('iframe')['src']
      doc = @agent.get src
    end
    doc.save(document_name))

where:

    params - the params submitted on the actual form are parsed for use by the new form
    builder - a nokogiri classed used to build xml/html documents
    node - the form elements contained in the builder's output
    f2 - the new form. See how it is attached to the current location of the Mechanize agent during initialization
    doc - the pdf to be downloaded

Using these techniques, I was able to switch from using the watir-webdriver for Capybara to Mechanize and achieve a 12x increase in performance on this application. Now, if I can just figure out what to do with the Ajax interactions...

Source: http://www.techhui.com/profiles/blogs/screen-scraping-tools-for-rails-developers

Monday 6 May 2013

What is Web Scraping ?

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc) is a technique employed to extract large amounts of data from websites.

Data from third party websites in the Internet can normally be viewed only using a web browser. Examples are data listings at yellow pages directories, real estate sites, social networks, industrial inventory, online shopping sites, contact databases etc. Most websites do not offer the functionality to save a copy of the data which they display to your local storage. The only option then is to manually copy and paste the data displayed by the website in your browser to a local file in your computer - a very tedious job which can take many hours or sometimes days to complete.

Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.

A Web Scraping software will interact with websites in the same way as your web browser. But instead of displaying the data served by the website on screen, the Web Scraping software saves the required data from the web page to a local file or database.
WebHarvy Web Scraper Demo
(Watch the demo)

WebHarvy is a point and click web scraper (visual web scraper) which lets you scrape data from websites with ease. Unlike most other web scraper software, WebHarvy can be configured to extract the required data from websites with mouse clicks. You just need to select the data to be extracted by pointing the mouse. Yes, it is that easy !. We recommend that you try the evaluation version of WebHarvy or see the video demo.

Source: http://www.webharvy.com/articles/what-is-web-scraping.html

Wednesday 1 May 2013

Google may drop screen scraping of TripAdvisor and Yelp content to get government off its back

The US Federal Trade Commission (FTC) may end its antitrust investigation of Google’s search business by letting the company make voluntary changes, “such as limiting use of restaurant and travel reviews from other websites,” according to sources who spoke with Politico and Bloomberg News.

The official decision is expected to be made public as soon as this week, according to the New York Times Bits blog.

Bye-bye, screen scraping

Two sources told Politico that the company had agreed to abstain from screen scraping synopses of reviews from other sites, such as TripAdvisor and Yelp, and stop incorporating them in its search results. (The company has already made some concessions. Last year, it scaled back using “snippets” of text from other sites in its results related to local businesses.)

Google would also commit to simplify the process for advertisers to port their rates and bidding data to rival advertising networks, such as Microsoft’s Bing. That means the company will make it easier for users of Google AdWords “to compare data from ad campaigns with their performance on other Internet search engines,” says Bloomberg.

The company will apparently also stop signing exclusive agreements with sites to use and promote Google’s search service.

A potential big win for Google

The probe is going Google’s way, if he reports of voluntary concessions are true. The company will have dodged a consent decree—a judge-approved deal that would have empowered the FTC to supervise specific Google search practices for a set period.

On the other hand, the deal could translate into a major concession by Google to content companies, such as TripAdvisor and Yelp. For years, Google has insisted that it has the right to copy and paste summaries from other sites, such as ratings and review excerpts.

Critics say the company uses these summaries in its search results to tip the scales in favor of its own products, an issue that’s become of more concern with Google having purchased and begun displaying travel content from ITA Software, Frommers, Zagat and other companies.

Apparently federal investigators are afraid they can’t prove that consumers are harmed by how the company ranks its search engine result pages (SERPs), according to an earlier report by Bloomberg. The company seems to be effective in its argument that the cost for consumers to switch to rival sources of information, such as Expedia or TripAdvisor is nothing.

Critics, including Fairsearch.org, had complained that that Google puts its own reviews, maps and services at the top of its results pages, a prime spot that draws most of users’ clicks.

There’s also apparently a lack of damning e-mails or other evidence that Google executives are deliberately trying to use its leverage to harm competitors. This may be the more crucial point. In the words of The New York Times, “the legal issue is the tactics the dominant company employs to expand its empire.”

As Tnooz has noted, Google hired 13 communications and lobbying firms to help fend off antitrust challenges. Critics insist there have been irregularities in how the FTC has handled its probe.

As Tnooz has reported, Google’s search practices are also being probed by European authorities, who obviously aren’t bound by any US government decision. Google accounts for 79% of searches in Europe, compared with 63% in the US. The greater volume combined with tighter regulation may make it easier for Europeans to make a case that Google has monopolistic power.

Source: http://www.tnooz.com/2012/12/17/news/report-google-to-drop-screen-scraping-of-tripadvisor-and-yelp-content-to-get-government-off-its-back/

Note:

Jazz Martin is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

Screen-scraping Addresses from the USPS

As I mentioned in my posts about Web Tools, I wasn’t entirely sure I’d get approval from the Post Office to use the Web Tools servers, so I came up with a plan B. That plan B being screen-scraping the USPS’ website. Like the others, I wrote the code originally in 2010. As it happened, a site design made me recently revisit the code…and rewrite it almost entirely.

PHP is a good candidate to handle this, as it has capabilities that make screen-scraping very simple. To start, you need to assemble a URL, from which the contents will be retrieved. To start with, you need an address to be validated. In my case, I put them in a MySQL table and put the whole thing in a loop. So, I will make something of an assumption, and dip right into the screen-scraping itself.

The URL can be assembled like so:

$haxUrl = 'https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=0&companyName=&';
$haxMsg = 'address1=' . trim($row['street']) . '&address2=' . trim($row['suite']) . '&city=' . trim($row['city']) . '&state=' . trim($row['state']) . '&urbanCode=&postalCode=&zip=';

I’ve already returned the address from MySQL in an associative array. I’ve learned to use associative arrays where possible – it makes rewriting code to add or remove fields a very simple matter. The next thing to be done here involves reworking the URL you’ve just assembled, as the USPS’ site doesn’t like URL-encoded URLs. That can be done something like so:

$haxMsg = urlencode($haxMsg);
$haxMsg = str_replace('%3D','=',$haxMsg);
$haxMsg = str_replace('%26','&',$haxMsg);
$haxMsg = str_replace('%20','+',$haxMsg);

Next, off to the USPS with it:

$newurl = $haxUrl . $haxMsg;
// retrieves the response from the server
raw = file_get_contents($newurl);
// replaces lines breaks
if ($raw) { // gets around non-responsive server error
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$errorNotFound = strpos($content, '<li class="error">');
$errorFoundNotFound = strpos($content, 'class="error clearfix">');
$errorMultipleAdd = strpos($content, 'Several addresses matched the information you provided');
$start = strpos($content, '');
$end = strpos($content, '');
// chop down the result to just what's between start and end
$valid = substr($content, $start, $end-$start);

The block continues a bit, so apologies for the open bracket without closing it. What you’re doing here is chopping the returned data down the relevant part – the address that’s been returned from the Post Office. From there, it’s just finding the address parts by their tags.

if ((!$errorNotFound) && (!$errorFoundNotFound) && (!$errorMultipleAdd)) {
$validLen = strlen($valid);
$cityStart = strpos($valid, '');
$stateStart = strpos($valid, '');
$zip5Start = strpos($valid, '');
$zip5End = strpos($valid, '');
$zip4Start = strpos($valid, '');

And so forth. Check for address1 and address2, add handling to split suite/room/floor/apt so it will always be on the suite line of the address, etc. Following that, in my case I write it back to another MySQL table. For my purposes, the addresses I’m validating belong to commercial accounts, and so I use the account number and department number where applicable as a ‘keys’ field, for a unique identifier. I grab it in the original query and write it back with the validated address.

In the case that an address comes back as either unrecognized or with a (frustratingly common) ‘multiple addresses exist’ error, I simply write back ‘address error’ or ‘multiple address’ in the street column of the validated table. Once I’ve pulled the data down into Excel, I can filter down to those and correct manually.

I run this script on the command line, so as to not have any time limitations. I typically put a short sleep on the loop, so as to be a little more kind to the Post Office’s servers. I used a rand(4, 30) to make it slightly less obvious that it was a script running, hitting their servers for more than 72 hours straight. When I wrote a little variation of this script without a sleep, I found that I could kick it off and have back an Excel workbook with 25 addresses in roughly 15 seconds. With a sleep of between 1 and 4 seconds, I ran over 27k addresses in just over 24 hours.

So, you can take advantage of PHP to ease the hideous tediousness that is validating addresses. However, be a little kind to the Post Office’s servers. After all, they’re not in the best sort of position now to add equipment or manpower. It could be I’ll be rewriting this code again in a couple years to scrape the site of FedEx or UPS.

Source: http://bendustries.co/wp/?p=39

Note:

Comparable Sales — and Widgets, APIs and Screen Scraping

Recently someone contacted me and inquired about pulling comparable sales info from Zillow, Trulia, Eppraisal, Yahoo and Cyberhomes. He also wanted to sort the details before storing it in excel for further analysis. If it works well, he wanted to package it and sell to the realtor community.

I told him that storing comparable sales data (or any other data for that matter) is against their Terms and conditions. Most of these providers (except Zillow and Eppraisal) do not provide APIs to get comparable sales. However, regardless, he found someone to do screen scraping these sites and create an excel spread sheet for a cheap price.

Screen scraping and stealing information from web sites has serious ramifications. This will create legal issues once the web site owners trace the activity to his IP address. His site also could be blacklisted. Even worse, screen scraping will stop functioning once the site makes minor changes to their HTML which is not uncommon in today’s world dominated by screen scrapers.

Screen scraping is simple parsing of web pages using a programming language (like PHP, Cold Fusion, Java, ASP, Perl, Python) looking for specific patterns in the HTML code extracting certain key details. This only requires basic programming skills and most of the languages make it easy with powerful parsing capabilities. This amounts to piracy and can have legal ramifications and is best avoided. The temptation is high given there are many freelancers over the internet offering cheap solutions using screen scraping.

This leads to the basic question – How can one access comparable sales information to attract traffic to his site?

The answer depends on your needs and capability. If you don’t want to get your hands dirty with the programming and/or you have low budget, your best bet will be using readily available widgets provided by most of these sites. You will only need some basic HTML skills to make sure the widget is placed properly on your web site without distorting the layout. There may also be plug-ins available (like the WordPress local market explorer) if you want to add these to your blog.

For advanced users with programming skills, you can try the API (Application programming Interface) offered by these providers. You can also hire programmers to do this for you. APIs are mostly Web services based on REST (Vs SOAP). Amazon was the pioneer in this area later embraced by most major players. There are very good frameworks or libraries available for using these APIs. This gives you maximum flexibility and you can combine this with other APIs like Google maps, Facebook, Twitter, Walkscore and Yelp (to name a few) to create very interesting end results, known as mashups. Word of caution – make sure that you read and follow their Terms & Conditions when doing this.

API offerings may be very limited in many cases and you may end up using the widgets in these situations. One also has to be aware of the API changes which may break your code requiring fixes to keep it running; something Google, Amazon, Twitter and Facebook have done frequently. I’d recommend making sure you have an ongoing relationship with the programmer when you hire for this kind of job. Don’t go only by the cost since most of them may not be around when your code needs fix.

Source: http://geekestateblog.com/comparable-sales-and-widgets-apis-and-screen-scraping/

Note: