DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Harvest Links From Within HREF Tags Using .net Regular Expressions

12.02.2009
| 4486 views |
  • submit to reddit
        Use the first regex to parse the source of a page to retrieve all the A tags.  The second parses out the actual contents of the HREF.  Below both of these is some sample VB.NET code as an example of usage.

This will work on pretty much any HREF value, regardless of how badly formed it is.  Examples of what it will work with..

<a href="wibble" onclick="dothis()"> - A nicely formed A tag
<a href=wibble onclick=dothis()> - Missing quotes, but will still work!
<a href='wibble' class="test"> - Single quotes, no problem
<a href=this has spaces id=test> - No quotes, spaces in url, still not a problem \O/

The only thing it will not work on, which is pretty out there, is a combination of every badly formed A tag in one.

<a href=wibble for=myvalue> - Custom attributes will break it, IF quotes are not used. Can easily change the regex to include custom attributes, as and when you encounter them.
<a href="wibble" for=myvalue> - For example, this will work just fine

Please note regex patterns are in a neutral format.  Make relevant changes to quotes for VB and c#

findATags = <A.*?>

ExtractHrefs = (?:href) ?=+ ?(?:(?:(?:"|')(.+?)(?:"|'))|(.+?)(?: class ?=| onclick ?=| id ?=| accesskey ?=| dir ?=| ltr ?=| lang ?=| style ?=| tabindex ?=| title ?=| onblur ?=| ondblclick ?=| onfocus ?=| onmousedown ?=| onmousemove ?=| onmouseout ?=| onmouseover ?=| onmouseup ?=| onkeydown ?=| onkeypress ?=| onkeyup ?=|>))

'Example code

public sub checkPage(byVal pagecontents as string)
     Mytext = Regex.Replace(UCase(pagecontents), "<A.*?>", AddressOf scanLinks, RegexOptions.Singleline)
End Sub
public sub scanLinks(byVal aTag as string)
     href = parseATags(hrefValue)
End Sub

Private Function parseATag(ByVal aTag As String) As String
     Dim result As String
     Dim m2 As Match = Regex.Match(aTag, "href+ ?=+ ?(?:(?:(?:""|')(.+?)(?:""|'))|(.+?)(?: class ?=| onclick ?=| id ?=| accesskey ?=| dir ?=| ltr ?=| lang ?=| style ?=| tabindex ?=| title ?=| onblur ?=| ondblclick ?=| onfocus ?=| onmousedown ?=| onmousemove ?=| onmouseout ?=| onmouseover ?=| onmouseup ?=| onkeydown ?=| onkeypress ?=| onkeyup ?=|>))", RegexOptions.Singleline Or RegexOptions.IgnoreCase)
     If m2.Groups(1).Value = "" Then
           result = m2.Groups(2).Value
     Else
           result = m2.Groups(1).Value
     End If
     Return result
End Function