DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

HTML Parser - Grabs The Link URLs + Link Texts From A Web Page And Put Them Into An Array

04.10.2012
| 11043 views |
  • submit to reddit
        
<?php
function parse_links($document) {

  # Zero or more whitespace characters
  $S0 = '\s*';

  # One or more whitespace characters
  $S1 = '\s+';

  # Anchor tag start
  $anch1 = '<a' . $S1;

  # href= pattern
  $href1 = 'href' . $S0 . '=' . $S0;

  # quoted strings, with selection
  $q1 = "'[^']'";
  $q2 = '"[^"]*"';
  $q = "($q1|$q2)";

  # full link pattern
  $link_RE = "$anch1$href1$q$S0>\s*(.*?)</a>";


  //global $q, $href1, $link_RE;
  preg_match_all("#$link_RE#i", $document, $matches);
  return $matches; // returns an array

} // end function parse_links()

//
// DEMO OF HOW TO USE THE FUNCTION

// grab a webpage
$str = implode('',file('http://del.icio.us'));

// call the parse_links function
$linkarray=parse_links($str);

// loop through the link array, outputting the URL + Link Text
for ($i = 0; $i < sizeof($linkarray[0]); $i++)
    echo ($linkarray[2][$i] .$linkarray[1][$i] . "<br>");

?>
    

Comments

Snippets Manager replied on Mon, 2007/05/14 - 11:57am

What do the #'s mean and the i mean after the $'s in this expression? #$link_RE#i I can't find any documentation on how or why that is displayed like that and not just $link_RE