DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world
Harvest Links From Within HREF Tags Using .net Regular Expressions
Use the first regex to parse the source of a page to retrieve all the A tags. The second parses out the actual contents of the HREF. Below both of these is some sample VB.NET code as an example of usage.
This will work on pretty much any HREF value, regardless of how badly formed it is. Examples of what it will work with..
<a href="wibble" onclick="dothis()"> - A nicely formed A tag <a href=wibble onclick=dothis()> - Missing quotes, but will still work! <a href='wibble' class="test"> - Single quotes, no problem <a href=this has spaces id=test> - No quotes, spaces in url, still not a problem \O/
The only thing it will not work on, which is pretty out there, is a combination of every badly formed A tag in one.
<a href=wibble for=myvalue> - Custom attributes will break it, IF quotes are not used. Can easily change the regex to include custom attributes, as and when you encounter them. <a href="wibble" for=myvalue> - For example, this will work just fine
Please note regex patterns are in a neutral format. Make relevant changes to quotes for VB and c#
findATags = <A.*?>
ExtractHrefs = (?:href) ?=+ ?(?:(?:(?:"|')(.+?)(?:"|'))|(.+?)(?: class ?=| onclick ?=| id ?=| accesskey ?=| dir ?=| ltr ?=| lang ?=| style ?=| tabindex ?=| title ?=| onblur ?=| ondblclick ?=| onfocus ?=| onmousedown ?=| onmousemove ?=| onmouseout ?=| onmouseover ?=| onmouseup ?=| onkeydown ?=| onkeypress ?=| onkeyup ?=|>))
'Example code
public sub checkPage(byVal pagecontents as string)
Mytext = Regex.Replace(UCase(pagecontents), "<A.*?>", AddressOf scanLinks, RegexOptions.Singleline)
End Sub
public sub scanLinks(byVal aTag as string)
href = parseATags(hrefValue)
End Sub
Private Function parseATag(ByVal aTag As String) As String
Dim result As String
Dim m2 As Match = Regex.Match(aTag, "href+ ?=+ ?(?:(?:(?:""|')(.+?)(?:""|'))|(.+?)(?: class ?=| onclick ?=| id ?=| accesskey ?=| dir ?=| ltr ?=| lang ?=| style ?=| tabindex ?=| title ?=| onblur ?=| ondblclick ?=| onfocus ?=| onmousedown ?=| onmousemove ?=| onmouseout ?=| onmouseover ?=| onmouseup ?=| onkeydown ?=| onkeypress ?=| onkeyup ?=|>))", RegexOptions.Singleline Or RegexOptions.IgnoreCase)
If m2.Groups(1).Value = "" Then
result = m2.Groups(2).Value
Else
result = m2.Groups(1).Value
End If
Return result
End Function





