Friday 24 May 2013

Web Page Screen Scraping

The DataLayer.Web Scraper element "scrapes", or retrieves, content from within web pages and organizes it into the row/column datasets usually found in datalayers. This provides an easy way for developers to automate the collection and use of data published on other web sites or in HTML files. This document discusses techniques for using the datalayer; topics include:

    About Web Page "Scraping"
    Using "Get XPaths" and Web Scraper Table
    Using Web Scraper Rows and Web Scraper Column


About "Web Page Scraping"

"Screen scraping" has been around for a long time; it's the process of extracting data from the text on a display screen, usually based on fields or screen regions. A key issue is that the data is intended to be viewed and is therefore neither documented nor structured for convenient parsing. In the age of the Internet, the term has also come to apply to the same process with web pages. In many cases, complicated parsing code is required.

However, the Logi approach makes the process fairly easy. Generally, in a Logi application, the target HTML web page is read (the entire page, not just the portion displayed), transformed into XML, and then parsed based on XPath identifiers. Data formatted into tables, and even in less-structured blocks of text, can readily be parsed and served up to the report developer in the standard Logi datalayer row/column structure. The data can then be further processed (filtered, grouped, aggregated, etc.) using all the usual datalayer child elements. This is all done using the DataLayer.Web Scraper element and its child elements.

The DataLayer.Web Scraper element is not available in Logi Java applications.

For purposes of convenience, this document refers to "HTML files" but file extensions are not really important; if the file will render to a web page, the datalayer will use it. Pages with .htm, .html, xhtml, .mht and other extensions are all valid.

One of the keys to this process is specifying the XPath parsing identifiers and DataLayer.Web Scraper provides a mechanism for assisting developers in selecting them, as discussed in the next section.

Note that this document illustrates techniques by referring to public web sites. While we make an effort to stay aware of them, these sites may change their coding from time to time without our knowledge, invalidating the actual data and values depicted, so you may not be able to achieve the identical results. The techniques themselves, however, remain valid.

Source: http://devnet.logianalytics.com/rdPage.aspx?rdReport=Article&dnDocID=1139

1 comment:

  1. Nice posting,thanks for sharing the informative blog and this blog provide the amazing thoughts of the screen scraping services.Now in the recent time every industry needed the web screen services and this article provide the information in easy way that people easy to understand.

    Web Scraping Software

    ReplyDelete