Facebook Like Box

Main Menu

HTML Parsing in C# using HTML Agility Pack

Today I will explain a very usefull library to parse the HTML in C#, the good feature of this library is that it works in almost same way as standard DOM XML parser in .NET and is quite tolerant to faulty HTML in real world web-pages.

Download the latest version of HTML Agility pack from the following location.

HTML Agility Pack

Before using HTML Agility Pack you must have kowledge of using XPath. XPath is used to iterate and access any node within a XML document. Different functions and expressions are available within XPath specification s to help access different kind of XML nodes. HTML Agility pack uses XPath to access any of node within a HTML document.

ExpressionDescription
nodename Selects all child nodes of the named node
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes

The above are some operators which you can use to reach any specific node.

You can find more about XPath on following link

XPath Syntax

As an example I will consider following HTML

 

 

‹html›
‹head›
‹title›Test Html‹/title›
‹/head›
‹body›
‹span› This is span 1 ‹/span›
‹span› This is span 2 ‹/span›
‹span› This is span 3 ‹/span›
‹/body›
‹/html›

Now following simple code will give you access to all spans of the above html.

try
{
      HtmlDocument HD = new HtmlDocument();
      HD.Load(new StringReader(rawHTML));
      //rawHTML will contain the above HTML.
      var SpanNodes = HD.DocumentNode.SelectNodes("//span");
      //for details of XPath syntax see the above table, // operater returns
      //all nodes matching the name after //. In above example it will return all
      //span nodes.
      if (SpanNodes != null)
      {
           foreach(SpanNodes SN in HeadingNodes)
           {
                string text = SN.FirstChild.InnerText.Trim();
                Console.WriteLine(text); //Will output all span's text
           }
      }
}catch(Exception e)
{
  //Write your exception handling code here
}

I hope You will find this library very usefull in your .NET applications. Do write comments if you want to add some thing or need any help.

Comments   

 
0 #6 Hard Code 2013-10-24 07:33
Quoting Ndlovu:
Hi if the web site has a list that goes on for more than a page can you automate that it continues untill no more pages are found .....

I think you need to write some crawler, and on each page fetch you can use the above code to parse the HTML.
Quote
 
 
0 #5 Hard Code 2013-10-24 07:31
Quoting scarecrow12w:
instead of displaying it can we store it in an array?

Yes, I think so, take a string array in foreach loop and store the InnerText in the array for each span (DOM element)
Quote
 
 
0 #4 scarecrow12w 2013-09-09 12:45
instead of displaying it can we store it in an array?
Quote
 
 
0 #3 Ndlovu 2013-03-16 21:10
Hi if the web site has a list that goes on for more than a page can you automate that it continues untill no more pages are found .....
Quote
 
 
+10 #2 Hamad 2012-10-17 10:56
I am getting error of HeadingNodes. it says "HeadingNodes doesn't exist".
Quote
 
 
0 #1 nam 2012-06-05 15:35
I have a small project and want you help me to solve with payment
pls call me on skype nam.truongthanh
Quote
 

Add comment


Security code
Refresh