Tuesday, May 26, 2009

Web Data Extraction with C++ Web Macro

Web data extraction or web scraping can be implemented in various ways. Today I will use Twebst Web Automation Library to extract search results from Google using DOM parsing method and Internet Explorer automation (you need to install Twebst Library first).

Here are the steps that C++ web macro will perform in order to extract results from Google search:
  • Open an Internet Explorer browser and navigate to Google site.
  • Find the search edit box and fill out the word to search.
  • Find the submit button and click it.
  • Wait until the page is loaded and find a DIV with id=res
  • Find the collection of all H3 elements inside the DIV element.
  • Extract the text and URL and display it.

Enough talk! Let the code speak for itself.

// Start a new Internet Explorer instance and navigate to a given URL.
IBrowserPtr pBrowser = pCore->StartBrowser("http://www.google.com/");

// Find search edit box in page and type some text into it.
IElementPtr pSearchEdit = pBrowser->FindElement("input text", SearchCondition("name=q"));
pSearchEdit->InputText("codecentrix");

// Find search button and click it.
IElementPtr pSearchBtn = pBrowser->FindElement("input submit", SearchCondition("text=Google Search"));
pSearchBtn->Click();

// Find the DIV element where the result are displayed.
IElementPtr pResultDiv = pBrowser->FindElement("div", SearchCondition("id=res"));

// Get all found results and print them in console.
IElementListPtr pResultList = pResultDiv->FindAllElements("h3", SearchCondition());

// Display only the header result (text and url).
for (int i = 0; i < pResultList->length; ++i)
{
    // Get current H3 in the list.
    IElementPtr pCrntResult = pResultList->Getitem(i);

    // Find first and only anchor inside H3
    IElementPtr pCrntAnchor = pCrntResult->FindElement("a", SearchCondition());
    CComQIPtr<IHTMLAnchorElement> spCrntAnchor = pCrntAnchor->nativeElement;

    // Get URL from IHTMLAnchorElement.
    CComBSTR bstrURL = "";
    spCrntAnchor->get_href(&bstrURL);

    // Display results.
    wcout << pCrntResult->text << L"\n" << bstrURL.m_str << L"\n\n";
}

Download:

4 comments:

Anonymous said...

Good Day!!! codecentrix.blogspot.com is one of the best informational websites of its kind. I enjoy reading it every day. All the best.

CodeCentrix said...

Thanks for your kind words sir!

git said...

Hi Codecentrix, kind of a stupid question, isn't this only a snippet of the code? Can you please provide the header you used for this?

Thank you

Adrian Dorache said...

Hi,

Download the zip file and you'll have all the code.

You don't need any header as the VC++ knows how to import COM type libraries.

#import "progid:Twebst.Core"
using namespace TwebstLib;

Adrian.