Showing posts with label COM. Show all posts
Showing posts with label COM. Show all posts

Tuesday, May 26, 2009

Web Data Extraction with C++ Web Macro

Web data extraction or web scraping can be implemented in various ways. Today I will use Twebst Web Automation Library to extract search results from Google using DOM parsing method and Internet Explorer automation (you need to install Twebst Library first).

Here are the steps that C++ web macro will perform in order to extract results from Google search:
  • Open an Internet Explorer browser and navigate to Google site.
  • Find the search edit box and fill out the word to search.
  • Find the submit button and click it.
  • Wait until the page is loaded and find a DIV with id=res
  • Find the collection of all H3 elements inside the DIV element.
  • Extract the text and URL and display it.

Enough talk! Let the code speak for itself.

// Start a new Internet Explorer instance and navigate to a given URL.
IBrowserPtr pBrowser = pCore->StartBrowser("http://www.google.com/");

// Find search edit box in page and type some text into it.
IElementPtr pSearchEdit = pBrowser->FindElement("input text", SearchCondition("name=q"));
pSearchEdit->InputText("codecentrix");

// Find search button and click it.
IElementPtr pSearchBtn = pBrowser->FindElement("input submit", SearchCondition("text=Google Search"));
pSearchBtn->Click();

// Find the DIV element where the result are displayed.
IElementPtr pResultDiv = pBrowser->FindElement("div", SearchCondition("id=res"));

// Get all found results and print them in console.
IElementListPtr pResultList = pResultDiv->FindAllElements("h3", SearchCondition());

// Display only the header result (text and url).
for (int i = 0; i < pResultList->length; ++i)
{
    // Get current H3 in the list.
    IElementPtr pCrntResult = pResultList->Getitem(i);

    // Find first and only anchor inside H3
    IElementPtr pCrntAnchor = pCrntResult->FindElement("a", SearchCondition());
    CComQIPtr<IHTMLAnchorElement> spCrntAnchor = pCrntAnchor->nativeElement;

    // Get URL from IHTMLAnchorElement.
    CComBSTR bstrURL = "";
    spCrntAnchor->get_href(&bstrURL);

    // Display results.
    wcout << pCrntResult->text << L"\n" << bstrURL.m_str << L"\n\n";
}

Download:

Monday, December 15, 2008

Free Web Macros for Internet Explorer

As I presented in my previous post, automating Internet Explorer can be a difficult task.
Twebst Web Automation Library can make things easier.

It gives full programmatic control over the Internet Explorer browser. Twebst is a library of COM object that can be used within any environment that supports COM, from scripting languages (JScript, VB Script) to high level programming languages (C#, C++). For more information, see Twebst Libray Online Documentation. And yes, it's free!

Get it FREE!

What Twebst can do?

  • increase productivity by automating repetitive web tasks
  • automate regression testing of web applications
  • automate web actions and data-entry
  • automatically log in to different web sites
  • fill out web-forms automatically
  • extract data from web pages (web scraping).
  • monitor web pages

Twebst features

  • Start new browsers and navigate to a specified URL.
  • Connect to existing browsers.
  • Search and access HTML elements and frames inside browsers.
  • Intuitive names for HTML elements using the text that appears on the screen.
  • Advanced search of browsers and HTML elements using regular expressions.
  • Perform actions on all HTML controls (button, combo-box, list-box, edit-box etc).
  • Simulates user behavior generating hardware or browser events.
  • Get access to native interfaces exposed by Internet Explorer so you don't need to learn new things if you already know IE web programming.
  • Synchronize web actions and navigation by waiting the page to complete in a specified timeout.
  • Available from any programming or script language that supports COM
  • Optimized search methods and collections.

Wednesday, December 10, 2008

What's wrong with Internet Explorer Automation?

The Microsoft Office products (Word, Excel, Power Point, Access, Outlook) allow their users to manipulate Office documents from Visual Basic or Visual Basic for Applications (VBA) code. It is possible to write a VBA macro in Excel that initializes a series of cells, and uses the cells to display a chart for instance.

Automation is the process of controlling one product from another product with the result that the client product can use the objects, methods, and properties of the server product. The client has access to the object model of the server.

Though Internet Explorer browser is not part of the Office suite, it supports automation. Here is a short sample:

// Create an IE automation object.
var ie = new ActiveXObject("InternetExplorer.Application");

// Make it visible and navigate to a given URL.
ie.Visible = true;
ie.Navigate("http://www.google.com/");

// Give it some time to load the page and then get the document.
WScript.Sleep(3000);
var doc = ie.Document;

// Fill out search field.
var edit = doc.getElementsByName("q").item(0);
edit.value = "codecentrix";

// ... and press the submit button.
var submit = doc.getElementsByName("btnG").item(0);
submit.click();


Here is ie_auto.js file for download.
However there are problems with Internet Explorer automation:
  • it may not work at all on Windows Vista unless the script is running at the same integrity level as iexplore.exe process. Simply clicking the js file won't do it. The script will run at medium integrity level and Internet Explorer has low integrity level and as result the script fails. If you run the script at high integrity level the newly started IE instance will have the same high integrity level and the script works (but this is not the best option from a security point of view). Changing the integrity level of the running script (or application) is not always the most desirable or easiest thing to do.
  • no support to "connect" to already existing IE documents.
  • difficult search of elements across all sub-documents inside frames/iframes (and sometimes impossible, see the point above).
  • difficult and time consuming search of HTML elements on attributes other than id or name (getElementById and getElementsByName are the only methods I know that search elements directly wihtout browsing element collections which might be very slow when performed out of process).
  • no direct support for synchronizing input actions (clicks, keys) with the HTML document loading (it could be implemented by registering to IE events like document complete or looping while the browser becomes ready to accept inputs).
  • no advanced search criteria like regular expression or searching on multiple attributes.
If you are interested in solving the issues above, let me introduce a project I've been working on for some time now. Here's Twebst, web automation library for Internet Explorer!

Get it FREE!


(to be continued)

Monday, March 31, 2008

.Net and COM interop story

.Net allows programmers to reuse COM components in their managed code. To make this possible a managed wrapper object around the native object is needed. Besides that, one can use the COM object like any other managed object. Even if it sounds simple, you have to be aware of the differences between the CLR's object lifetime management and the COM version of object lifetime management.

COM programmers have to call Release on every interface that has been AddRef'ed. For C# programmers using COM objects that means AddRef is called when:
- a COM object is created.
- a COM object is returned by calling a method or a property.
- a COM object is cast'ed to another COM interface type.

To release a COM object in C# there are two options:
- leave the GC to collect managed wrappers and to call their finalizers that will call Release on native COM object.
- manually call Marshal.ReleaseComObject on every interface used in the code.

Let's see a short example using COM objects exposed by IE. The code bellow changes the color of every link in a HTML document.

// IHTMLDocument2 doc;
foreach (IHTMLElement elem in doc.all)
{
IHTMLAnchorElement anchor = elem as IHTMLAnchorElement;
if (anchor != null)
{
elem.style.color = "red";
}
}
This first approach leaves the task of releasing COM objects to garbage collector. Let's manually release COM objects now:

// IHTMLDocument2 doc;
IHTMLElementCollection allCollection = doc.all;
foreach (IHTMLElement crntElem in allCollection)
{
IHTMLAnchorElement anchor = crntElem as IHTMLAnchorElement;
if (anchor != null)
{
IHTMLStyle style = crntElem.style;
style.color = "red";

Marshal.ReleaseComObject(style);
Marshal.ReleaseComObject(anchor);
}

Marshal.ReleaseComObject(crntElem);
}

Marshal.ReleaseComObject(allCollection);

As you can see the number of code lines doubles! I personally prefer to leave the task of releasing COM objects to GC even if they will be eventually released after some time when GC comes into action.

Some might be tempted to call GC.Collect after a large chunk of code that work with COM objects but this could be even worse because other managed objects could be promoted to next GC generation and their lifespan is therefore longer than necessary.

In theory it is possible to create a lot of large COM objects that will exceed the native heap while the managed heap has a lot of available memory because managed wrappers are smaller in size. GC won't be called in this scenario so the native heap won't be freed.

If your application suffers from this kind of memory allocation problem, maybe using COM objects from managed code is not the best approach for you.

Saturday, February 02, 2008

When IHTMLWindow2.document throws UnauthorizedAccessException

This is basically a C# translation of one of my older articles "When IHTMLWindow2::get_document returns E_ACCESSDENIED". Some .Net people encountered difficulties to use it, so I decided to make their life easier.

The main problem is the confusion created by System.IServiceProvider .Net interface because it has the same name as the COM interface. Once this issue is passed the code translation is straightforward. Here's the interop code to declare the COM interface IServiceProvider.
// This is the COM IServiceProvider interface, not System.IServiceProvider .Net interface!
[ComImport(), ComVisible(true), Guid("6D5140C1-7436-11CE-8034-00AA006009FA"),
InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)]
public interface IServiceProvider
{
[return: MarshalAs(UnmanagedType.I4)][PreserveSig]
int QueryService(ref Guid guidService, ref Guid riid, [MarshalAs(UnmanagedType.Interface)] out object ppvObject);
}
You find here full source code of the sample assembly.

This technique was successfully implemented and tested in Twebst web automation library.

Wednesday, January 09, 2008

Nunit and STAThread story

I use NUnit unit-testing framework to test my pet project Twebst. Being a collection of COM objects, Twebst can be used within any environment that supports COM. That means it can be used from .Net languages like C#.

First I started by creating an assembly to be used from NUnit GUI. Some tests failed without an obvious reason. After some research I understood that the COM apartment must be STAThread. The threading model must be set before the thread is started but I don't have access to NUnit GUI main thread from my assembly.

One possible solution to this problem is to transform the assembly into an EXE application that uses the NUnit framework like this:

[STAThread]
public static void Main(string[] args)
{
NUnit.ConsoleRunner.Runner.Main(
new string[] { System.AppDomain.CurrentDomain.BaseDirectory + "MyExe.exe", "/nothread" });
}
When /nothread command line flag is used the tests are executed by the main thread which already has the right COM apartment properly set.

Wednesday, November 14, 2007

How to get a handle to current TabWindowClass tab in IE7

IWebBrowser2::get_HWND method gets the handle of the Internet Explorer 7 main window. Sometimes the tab window handle is needed. Here's a sample code that I recently found in MSDN. It shows how to get the handle of the tab window starting from a IWebBrowser2 object.
#include <shlguid.h>

HWND GetTabWnd(CComQIPtr<IWebBrowser2> spBrowser)
{
HWND hwndTab = NULL;
CComQIPtr<IServiceProvider> spServiceProvider = spBrowser;

if (spServiceProvider != NULL)
{
CComQIPtr<IOleWindow> spWindow;
if (SUCCEEDED(spServiceProvider->QueryService(
SID_SShellBrowser,
IID_IOleWindow,
(void**)&spWindow)))
{

spWindow->GetWindow(&hwndTab));
}
}

return hwndTab;
}
I think the code is supposed to work on top level IWebBrowser2 objects. You can read more about top browser objects in my previous article.

This technique was successfully implemented and tested in Twebst web automation library.

Monday, November 12, 2007

When IWebBrowser2::get_HWND returns E_FAIL

IWebBrowser2::get_HWND "gets the handle of the Microsoft Internet Explorer main window". As any COM method get_HWND returns a HRESULT value. According to MSDN, the method "returns S_OK if successful, or an error value otherwise".

It was hard for me to imagine how this method could fail but I still got an E_FAIL return value. This happened because the IWebBrowser2 object was not the top level browser. A web page containing frames/iframes is represented by a hierarchy of IHTMLWindow objects. Each window has an associated IHTMLDocument2 object exposed by IHTMLWindow2::get_document. An IHTMLWindow can be also converted to a IWebBrowser2 object. Here's my solution to get the main window handle starting from a non-top level browser object (this is a common scenario when adding your custom menu item in the IE context menu).

// IHTMLWindow2 to IWebBrowser2
CComQIPtr<IWebBrowser2> IHTMLWindow2ToIWebBrowser2(CComQIPtr<IHTMLWindow2> spHTMLWindow)
{
ATLASSERT(spHTMLWindow != NULL);

// Query for a service provider.
CComQIPtr<IWebBrowser2> spBrowser;
CComQIPtr<IServiceProvider> spServiceProvider = spHTMLWindow;

if (spServiceProvider != NULL)
{
// Ask the service provider for a IWebBrowser2 object.
spServiceProvider->QueryService(IID_IWebBrowserApp, IID_IWebBrowser2, (void**)&spBrowser);
}

return spBrowser;
}

// IWebBrowser2 to IHTMLWindow2
CComQIPtr<IHTMLWindow2> IWebBrowserToIHTMLWindow(CComQIPtr<IWebBrowser2> spBrowser)
{
ATLASSERT(spBrowser != NULL);
CComQIPtr<IHTMLWindow2> spWindow;

// Get the document of the browser.
CComQIPtr<IDispatch> spDisp;
spBrowser->get_Document(&spDisp);

// Get the window of the document.
CComQIPtr<IHTMLDocument2> spDoc = spDisp;
if (spDoc != NULL)
{
spDoc->get_parentWindow(&spWindow);
}

return spWindow;
}


CComQIPtr<IWebBrowser2> TopBrowser(CComQIPtr<IWebBrowser2> spBrowser)
{
ATLASSERT(spBrowser != NULL);

// Retrieve IHTMLWindow2 from browser.
CComQIPtr<IHTMLWindow2> spHTMLWnd = IWebBrowserToIHTMLWindow(spBrowser);
if (spHTMLWnd != NULL)
{
// Find top window.
CComQIPtr<IHTMLWindow2> spTopWindow;
HRESULT hResult = spHTMLWnd->get_top(&spTopWindow);

if (SUCCEEDED(hResult) && (spTopWindow != NULL))
{
// Convert the browser object to window.

return IHTMLWindow2ToIWebBrowser2(spTopWindow);
}
}

return CComQIPtr<IWebBrowser2>();
}

This technique was successfully implemented and tested in My web automation library.

Wednesday, October 17, 2007

How to properly catch RBN_CHEVRONPUSHED notification?

This is actually a thread I started on MSDN forum but unfortunately it remained unanswered:

Virtually any IE toolbar needs a chevron to happily live along with other toolbars in the same re-bar. So does my toolbar. To implement chevron functionality in IE toolbars I need to handle RBN_CHEVRONPUSHED. According to MSDN, when the chevron button is pushed, the notification is sent by the rebar in the form of WM_NOTIFY message to its parent. Here is the windows hierarchy in IE7:

WorkerW <- ReBarWindow32 <- ToolbarWindow32

where the last toolbar window is my toolbar. So I need to catch notifications from ReBarWindow32 that are sent to WorkerW window. To do that the first idea that came to my mind was to subclass the WorkerW window. I don’t like this idea because:
  • I subclass a window that does not belong to me, it was created by IE.
  • I don’t know what is the best time to subclass it: on IObjectWithSite.SetSite or on IDockingWindow.ShowDW ? (Those functions are implemented by my toolbar component)
  • I don’t know what is the best time to un-subclass it.
  • I don’t know when other toolbar might subclass/un-subclass the same window (I actually got a conflict with other toolbar resulting in IE stack overflow crash because of the order of subclassing/unsubclassing).
My second approach uses RB_SETPARENT to modify the parent of ReBarWindow32 window to be one of my windows. I process the RBN_CHEVRONPUSHED notification for my chevron button and send the other notifications to the original parent window (that is WorkerW). I change the parent on toolbar initialization/un-init (IObjectWithSite.SetSite). It seems a safer approach but I’m still worried about other toolbars using the same technique and the possibility of conflicts.

Take care of standard IE "Links" toolbar that also sends WM_COMMAND, WM_DRAWITEM and WM_MEASUREITEM messages to WorkerW window (and now you'll get those messages too). On IE6, "Go" button also do the same.

So the question remains: what is the best way to catch RBN_CHEVRONPUSHED notification when creating an IE toolbar extension?

Wednesday, October 10, 2007

When IHTMLWindow2::get_document returns E_ACCESSDENIED

Internet Explorer extensions usually needs to access HTML elements. When extensions are initialized they get a IWebBrowser2 pointer representing the browser. Starting with this pointer one can get any HTML element in the web page but to do that we need to browse a hierarchy of frames first. The simplest web pages only have one frame and one document. Web pages containing <frame> or <iframe> have a hierarchy of frames, each frame having its own document.

Here are the objects involved and the corresponding interfaces:
browser      - IWebBrowser2
frame/iframe - IHTMLWindow2
document - IHTMLDocument2
element - IHTMLElement


The list bellow shows what method to call to get one object from another:
browser      -> document        IWebBrowser2::get_Document
document -> frame IHTMLDocument2::get_parentWindow
frame -> document IHTMLWindow2::get_document
frame -> parent frame IHTMLWindow2::get_parent
frame -> children frames IHTMLWindow2::get_frames


A normal call chain to get a HTML element is:
browser -> document -> frame -> child frame -> ... -> child frame -> document -> element

This will work almost all the time. The problems arise when different frames contain documents loaded from different internet domains. In this case IHTMLWindow2::get_document returns E_ACCESSDENIED when trying to get the document from the frame object. I think this happens to prevent cross frame scripting atacks.

Here is HtmlWindowToHtmlDocument function I wrote to be used instead IHTMLWindow2::get_document to bypass the restriction:



// Converts a IHTMLWindow2 object to a IHTMLDocument2. Returns NULL in case of failure.
// It takes into account accessing the DOM across frames loaded from different domains.

CComQIPtr<IHTMLDocument2> HtmlWindowToHtmlDocument(CComQIPtr<IHTMLWindow2> spWindow)
{
ATLASSERT(spWindow != NULL);

CComQIPtr<IHTMLDocument2> spDocument;
HRESULT hRes = spWindow->get_document(&spDocument);

if ((S_OK == hRes) && (spDocument != NULL))
{
// The html document was properly retrieved.
return spDocument;
}

// hRes could be E_ACCESSDENIED that means a security restriction that
// prevents scripting across frames that loads documents from different internet domains.

CComQIPtr<IWebBrowser2> spBrws = HtmlWindowToHtmlWebBrowser(spWindow);
if (spBrws == NULL)
{
return CComQIPtr<IHTMLDocument2>();
}

// Get the document object from the IWebBrowser2 object.
CComQIPtr<IDispatch> spDisp;
hRes = spBrws->get_Document(&spDisp);
spDocument = spDisp;

return spDocument;
}


// Converts a IHTMLWindow2 object to a IWebBrowser2. Returns NULL in case of failure.
CComQIPtr<IWebBrowser2> HtmlWindowToHtmlWebBrowser(CComQIPtr<IHTMLWindow2> spWindow)
{
ATLASSERT(spWindow != NULL);

CComQIPtr<IServiceProvider> spServiceProvider = spWindow;
if (spServiceProvider == NULL)
{
return CComQIPtr<IWebBrowser2>();
}

CComQIPtr<IWebBrowser2> spWebBrws;
HRESULT hRes = spServiceProvider->QueryService(IID_IWebBrowserApp, IID_IWebBrowser2, (void**)&spWebBrws);
if (hRes != S_OK)
{
return CComQIPtr<IWebBrowser2>();
}

return spWebBrws;
}
Here is the C# version of the code: "When IHTMLWindow2.document throws UnauthorizedAccessException".

This technique was successfully implemented and tested in My web automation library.

Wednesday, October 03, 2007

From IAccessible to IHTMLElement and back

When developing Internet Explorer plug-ins you might want to take advantage of the dual nature of the browser objects. Internet Explorer exposes the DOM (document object model) using IHTMLElement interface as the building block of the hierarchy. It also offers accessible objects through Active Accessibility IAccessible interface.


// From IAccessible to IHTMLElement.
CComQIPtr<IHTMLElement> AccessibleToHTMLElement(IAccessible* pAccessible)
{
ATLASSERT(pAccessible != NULL);

// Query for IServiceProvider interface.
CComQIPtr<IServiceProvider> spServProvider = pAccessible;
if (spServProvider != NULL)
{
// Ask the service for a IHTMLElement object.
CComQIPtr<IHTMLElement> spHtmlElement;
HRESULT hRes = spServProvider->QueryService(IID_IHTMLElement, IID_IHTMLElement,
(void**)&spHtmlElement);

return spHtmlElement;
}

return CComQIPtr<IHTMLElement>();
}

// From IHTMLElement to IAccessible.
CComQIPtr<IAccessible> HTMLElementToAccessible(IHTMLElement* pHtmlElement)
{
ATLASSERT(pHtmlElement != NULL);

// Query for IServiceProvider interface.
CComQIPtr<IServiceProvider> spServProvider = pHtmlElement;
if (spServProvider != NULL)
{
// Ask the service for a IAccessible object.
CComQIPtr<IAccessible> spAccessible;
HRESULT hRes = spServProvider->QueryService(IID_IAccessible, IID_IAccessible,
(void**)&spAccessible);

return spAccessible;
}

return CComQIPtr<IAccessible>();
}


Not all IHTMLElement objects support Active Accessibility. Here is the list of the HTML elements that are also accessible elements:
A, AREA, BUTTON, INPUT type=BUTTON, INPUT type=RESET, INPUT type=SUBMIT, FRAME, IMG, INPUT type=checkbox, INPUT type=image, INPUT type=password, INPUT type=radio, MARQUEE, OBJECT, APPLET, EMBED, SELECT, TABLE, TD, TH, TEXTAREA, INPUT type=TEXT.

This technique was successfully implemented and tested in My web automation library.

Tuesday, July 17, 2007

COM number speller

Some time ago, I was asked to provide a way to print numbers as text in Romanian language. Interestingly, I couldn't find an easy way to do this in Excel (Office 2003 as far as I remember). Searching the internet, I found other people having the same problem. Microsoft provides a solution for English language, but not for Romanian.

I decided to create a reusable piece of code that will be easily used in as many environments as possible. I chose COM and ATL to be the solution to my problem. This is also a good programming exercise for my rusty COM / C++ skills and I plan to use it as a template for all my future COM objects.

Points of interest:
  • error info support by implementing IErrorInfo interface.
  • Help in CHM format and context identifiers specified in MIDL source file.
  • BSTR manipulation using CComBSTR class provided by ATL.
  • IDispatch support, so the component can be used from scripting environments.
  • Spelling implementation itself and support for multiple languages (only Romanian and English for now).

To install the COM object just simply run Install.bat
Here is the Excel macro that makes use of NumberSpeller COM object:

Function Spell(n As Currency) As String
'Create the speller object.
Dim s As NumberSpellerLib.speller
Set s = New NumberSpellerLib.speller

Dim o As Object
Set o = s
o.Language = "ro"
Spell = o.Translate(n)
End Function


Downloads: