So, in this week’s installment, we’ll look at some basic HTML parsing methods and also how to fill out forms and submit them via code. I still see a lot people asking how to get the text from a specific hyperlink or setting the value of an input box on a web page. In this post, I’ll try to cover the method I use most when working with HTML parsing. I’ll show you how to get the link text from a hyperlink, set the text of an input box or textarea field, and I’ll also show you how to click form buttons to submit forms.
Every HTML element, such as anchors, divs, img, input, all have what’s called “attributes.” Here is an example of some general HTML code that shows you the use of attributes:
<input type="text" name="log" id="user_login" class="input" value="" size="20" tabindex="10" /> <input type="password" name="pwd" id="user_pass" class="input" value="" size="20" tabindex="20" /> <input type="submit" name="wp-submit" id="wp-submit" class="button-primary" value="Log In" tabindex="100" />
In each of those lines above, every word that comes before an equals sign (=) is considered an attribute. Each HTML element has specific attributes, some of which are common among all of them, but I do not want to go into that with this post. A basic understanding of what they are and how they’ll be used in our VB world is all that is needed for this post.
So now that we understand attributes and are familiar with their syntax, placement, and function, let’s look at how we can set them and retrieve them using VB. In VB, there are two methods that will be your “go to” tactics for doing this: .SetAttribute and .GetAttribute (can you guess which one gets and which one sets? *wink*)
Set the value of an input box or textarea:
There are 2 ways to do this. One option is to use the .GetElementById method of the HTML Document. If you’re lucky, the web page you’re working with will use the ID attribute of every HTML element in the HTML code. This makes it a lot easier to parse it with VB. Here is an example of setting the value of an input box with the ID of “id”:
WebBrowser1.Document.GetElementById("id").SetAttribute("value", "New Value")
What we’ve done there is fetched the HTML element “id” and set its “value” attribute to “New Value.” For input boxes, the value is what is shown inside the input box.
The other way to set the value of an input box with VB is to loop through the HTML collection of inputs and find the one you need based on an attribute value. The following code chunk should be put in your black book of code tricks as you’ll be using it a lot if HTML parsing is something you do often:
Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("input") For Each curElement As HtmlElement In theElementCollection curElement.SetAttribute("value", "New Value") Next
Without getting into the details, the above code merely gets all the elements with the tag “input” and stores them in an “HTML Element Collection”. This allows us to then loop through this collection of “inputs” and do what we’d like with each one. Here are a couple of ways to get different tags:
To get all hyperlinks:
To get all inputs:
To get all divs:
To get all spans:
To get all images:
The For Loop then loops through the collection and for each element (curElement), you have the available fore-mentioned methods to use to get/do what you need. Using .SetAttribute allows you to set the value of any attribute for that element, while .GetAttribute allows you to retrieve the value of any attribute. In addition to retrieving the attribtue values, VB also allows you to fetch other things like the .InnerHTML (HTML inside the element’s tags), the .InnerText (text between the element’s tags), .OuterHTML (HTML of the element’s parent), and .OuterText (the text between the parent’s elements’ tags).
Clicking an HTML element such as a button or hyperlink:
So.. the HTML element we’ll be using this for most commonly is the “input” button, which will usually have an attribute of “type”. When looking to click a button, the attribute “type” will usually have a value of “submit”. That is the one we want!
Pop Quiz: Question: How many ways are there to do this? Answer: 2!
We can address the input button by ID if it is provided in the HTML code, or we can loop through the collection of Input elements. If we have to take the loop route, what we would do is test the .GetAttribute(“type”) value to see if it is equal to “submit”. If it is, then we’ll “click” it. Here’s how that would look:
Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("input") For Each curElement As HtmlElement In theElementCollection If curElement.GetAttribute("type").ToLower = "submit" Then curElement.InvokeMember("click") End If Next
We call the .InvokeMember method on the HTML element which basically translates to “perform the following action on this element”. In our case, the action is to “click” it. This works for input buttons, hyperlinks, images, or anything else that you would be able to click normally with a mouse!
While this isn’t the most in-depth look at HTML automation, hopefully it will give you a rough idea of the procedures used most commonly to set an HTML field’s value, or retrieve a particular value from the HTML. I make use of this “go to” HTML loop in my Scraper class to make it even easier to use!