Tuesday, May 26, 2009

Web Data Extraction with C++ Web Macro

Web data extraction or web scraping can be implemented in various ways. Today I will use Twebst Web Automation Library to extract search results from Google using DOM parsing method and Internet Explorer automation (you need to install Twebst Library first).

Here are the steps that C++ web macro will perform in order to extract results from Google search:
  • Open an Internet Explorer browser and navigate to Google site.
  • Find the search edit box and fill out the word to search.
  • Find the submit button and click it.
  • Wait until the page is loaded and find a DIV with id=res
  • Find the collection of all H3 elements inside the DIV element.
  • Extract the text and URL and display it.

Enough talk! Let the code speak for itself.

// Start a new Internet Explorer instance and navigate to a given URL.
IBrowserPtr pBrowser = pCore->StartBrowser("http://www.google.com/");

// Find search edit box in page and type some text into it.
IElementPtr pSearchEdit = pBrowser->FindElement("input text", SearchCondition("name=q"));
pSearchEdit->InputText("codecentrix");

// Find search button and click it.
IElementPtr pSearchBtn = pBrowser->FindElement("input submit", SearchCondition("text=Google Search"));
pSearchBtn->Click();

// Find the DIV element where the result are displayed.
IElementPtr pResultDiv = pBrowser->FindElement("div", SearchCondition("id=res"));

// Get all found results and print them in console.
IElementListPtr pResultList = pResultDiv->FindAllElements("h3", SearchCondition());

// Display only the header result (text and url).
for (int i = 0; i < pResultList->length; ++i)
{
    // Get current H3 in the list.
    IElementPtr pCrntResult = pResultList->Getitem(i);

    // Find first and only anchor inside H3
    IElementPtr pCrntAnchor = pCrntResult->FindElement("a", SearchCondition());
    CComQIPtr<IHTMLAnchorElement> spCrntAnchor = pCrntAnchor->nativeElement;

    // Get URL from IHTMLAnchorElement.
    CComBSTR bstrURL = "";
    spCrntAnchor->get_href(&bstrURL);

    // Display results.
    wcout << pCrntResult->text << L"\n" << bstrURL.m_str << L"\n\n";
}

Download:

Tuesday, May 19, 2009

IE Web Login Automation

One highly repetitive web task is the logon to a web site. This is a common scenario where Twebst Web Automation Library really shines. Here is a short web macro written in JScript language that automatically logs you on Yahoo Mail site. All you have to do is to replace "UUUUUUUUUU" and "PPPPPPPPPP" with your user name and password in the code below.

// Open a browser and navigate to yahoo mail login page.
var core = new ActiveXObject("Twebst.Core");
var browser = core.StartBrowser("https://login.yahoo.com/config/mail?.intl=us");

// Find login fields.
var u = browser.FindElement("input text");
var p = browser.FindElement("input password");
var s = browser.FindElement("input submit");

// Log on to site by filling the user-name and password fileds and then click submit boutton.
u.InputText("UUUUUUUUUU");
p.InputText("PPPPPPPPPP");
s.Click();

FindElement searches thru all frames/iframes hierarchy for the first input element of type text/password/submit. Additional conditions can be specified for search (like searching an element by id/name or any other HTML attribute). Search conditions can make use of regular expressions if needed.

One more important thing is that FindElement method waits for the web page to be completely loaded before searching the element (the timeout can be specified by using core.loadTimeout property). Read more about Twebst Library...

Download:

Monday, May 18, 2009

Twebst Web Automation Library v1.40 released

Twebst version 1.40 is launched!
Main changes include IE8 compatibility, better support for working with embeded IE browser control, support for modal and modeless HTML dialogs and functions for clipboard access.

Here is the list of new features and enhancements:
- NEW: IE8 is now supported
- ENH: core.AttachToNative* methods work now with hosted IE browser control
- BUG: various fixes
- NEW: core.foregroundBrowser property
- NEW: core.productName property
- NEW: core.productVersion property
- NEW: core.GetClipboardText method
- NEW: core.SetClipboardText method
- NEW: core.AttachToWnd method
- NEW: core.NativeWindowToNativeBrowser method
- NEW: core.NativeWindowToNativeDocument method
- NEW: core.NativeWindowToNativeDocument
- NEW: browser.FindModalHtmlDialog method
- NEW: browser.FindModelessHtmlDialog method
- NEW: element.GetAttribute method
- NEW: element.SetAttribute method
- NEW: element.RemoveAttribute method
- NEW: element.tagName property
- NEW: element.FindParentElement method
- NEW: core.RightClick method-
- Find more ...

Free Download Twebst Library 1.40