Showing posts with label HTML. Show all posts
Showing posts with label HTML. Show all posts

Tuesday, May 26, 2009

Web Data Extraction with C++ Web Macro

Web data extraction or web scraping can be implemented in various ways. Today I will use Twebst Web Automation Library to extract search results from Google using DOM parsing method and Internet Explorer automation (you need to install Twebst Library first).

Here are the steps that C++ web macro will perform in order to extract results from Google search:
  • Open an Internet Explorer browser and navigate to Google site.
  • Find the search edit box and fill out the word to search.
  • Find the submit button and click it.
  • Wait until the page is loaded and find a DIV with id=res
  • Find the collection of all H3 elements inside the DIV element.
  • Extract the text and URL and display it.

Enough talk! Let the code speak for itself.

// Start a new Internet Explorer instance and navigate to a given URL.
IBrowserPtr pBrowser = pCore->StartBrowser("http://www.google.com/");

// Find search edit box in page and type some text into it.
IElementPtr pSearchEdit = pBrowser->FindElement("input text", SearchCondition("name=q"));
pSearchEdit->InputText("codecentrix");

// Find search button and click it.
IElementPtr pSearchBtn = pBrowser->FindElement("input submit", SearchCondition("text=Google Search"));
pSearchBtn->Click();

// Find the DIV element where the result are displayed.
IElementPtr pResultDiv = pBrowser->FindElement("div", SearchCondition("id=res"));

// Get all found results and print them in console.
IElementListPtr pResultList = pResultDiv->FindAllElements("h3", SearchCondition());

// Display only the header result (text and url).
for (int i = 0; i < pResultList->length; ++i)
{
    // Get current H3 in the list.
    IElementPtr pCrntResult = pResultList->Getitem(i);

    // Find first and only anchor inside H3
    IElementPtr pCrntAnchor = pCrntResult->FindElement("a", SearchCondition());
    CComQIPtr<IHTMLAnchorElement> spCrntAnchor = pCrntAnchor->nativeElement;

    // Get URL from IHTMLAnchorElement.
    CComBSTR bstrURL = "";
    spCrntAnchor->get_href(&bstrURL);

    // Display results.
    wcout << pCrntResult->text << L"\n" << bstrURL.m_str << L"\n\n";
}

Download:

Monday, May 18, 2009

Twebst Web Automation Library v1.40 released

Twebst version 1.40 is launched!
Main changes include IE8 compatibility, better support for working with embeded IE browser control, support for modal and modeless HTML dialogs and functions for clipboard access.

Here is the list of new features and enhancements:
- NEW: IE8 is now supported
- ENH: core.AttachToNative* methods work now with hosted IE browser control
- BUG: various fixes
- NEW: core.foregroundBrowser property
- NEW: core.productName property
- NEW: core.productVersion property
- NEW: core.GetClipboardText method
- NEW: core.SetClipboardText method
- NEW: core.AttachToWnd method
- NEW: core.NativeWindowToNativeBrowser method
- NEW: core.NativeWindowToNativeDocument method
- NEW: core.NativeWindowToNativeDocument
- NEW: browser.FindModalHtmlDialog method
- NEW: browser.FindModelessHtmlDialog method
- NEW: element.GetAttribute method
- NEW: element.SetAttribute method
- NEW: element.RemoveAttribute method
- NEW: element.tagName property
- NEW: element.FindParentElement method
- NEW: core.RightClick method-
- Find more ...

Free Download Twebst Library 1.40

Wednesday, April 29, 2009

Homemade Handcrafted Help System

A good documentation is very important for any serious project. It comes a moment in life, when programmers find themselves working on the help system. That is what happened to me during the later stages of Twebst Web Automation Library project. Even though this project is not open source, I try to make public as many parts of it as possible. Today I will present Twebst Help System and how this tedious and annoying task of creating it, was automated.

Twebst is a library of COM objects used to automate Internet Explorer browser. The objects and the supported properties and methods have to be documented . The page structure is the same for every object/method/property and it also contains code samples that need syntax highlighting. This is good news because it leaves a lot of place for templates and automation.

Here is the solution:
  • The template is an XML document. When documenting an object/method or property the focus is on the content rather than on formatting the text. There is one XML file for each object/method/property.
  • A WSH script written in jscript parses the XML document and adds syntax highlighting to sample code in the documentation page. Regular expression are used for parsing.
  • cross references are added automatically by the same script.
  • then a XSL transformation is applied to convert XML source to a HTML document that will be eventually written to disk.
  • The whole process is optimized by removing unnecessary operations like generating the HTML when it already exists and is newer than its XML source.
  • Finally the HTML documents refers a CSS style sheet to easily change the look.

It goes like this:
XML + JScript-> XML with color syntax and cross references + XSL -> HTML + CSS -> CHM

For local help, the CHM compiler is invoked as a final step and a CHM Help File is generated. All you have to do is launching Build.js script you may find in the archive below.


Downloads: TwebstHelp.zip

Prerequisites: In order to build the CHM file you'll need HTML Help Workshop from Microsoft.

Saturday, August 09, 2008

focus vs fireEvent("onfocus")

While working on Twebst web automation library I encountered this problem: how to simulate setting the focus on HTML edit controls in Internet Explorer? There are two ways to do this.

  1. Call IHTMLElement2::focus() method on target element that "causes the element to receive the focus and executes the code specified by the onfocus event".
  2. Rise onfocus event on target element by calling IHTMLElement3::fireEvent() method.

The two approaches are quite similar but there are some interesting differences.

  1. fireEvent("onfocus") does not actually set the focus on the element, it just executes the code of the onfocus handler event.
  2. Calling focus method sets the focus on target element and call the onfocus event handler but not immediately. The onfocus event seems to be inserted in a queue and its handler is executed asynchronously after the current handler is finished.
  3. If focus method is called from inside the onfocus handler nothing happens if the control already has the focus (that prevents an infinite recursion).

Example:


<html>
<script type="text/javascript" language="javascript">
function BtnFocusClick()
{
     document.getElementById('editTest').focus();
     window.status += "b";
}

function BtnOnFocusClick()
{
     document.getElementById('editTest').fireEvent('onfocus');
     window.status += "c";
}

function EditOnFocus()
{
     window.status += "a";
}
</script>

<body>
     <input type="text" onfocus="EditOnFocus()"; id="editTest"/><br/>
     <input type="button" value="focus" id="btnFocus" onclick="BtnFocusClick();"/>
     <input type="button" value="fire onfocus" id="btnOnFocus" onclick="BtnOnFocusClick();"/>
</body>
</html>

If pressing the button "fire onfocus" button the message in the Internet Explorer status bar is the expected one "ac". If pressing the "focus" button, the message is in reverse order than expected: "ba". That suggests that EditOnFocus handler is called after BtnFocusClick exit.