DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Space-Separated Tag Parser

03.03.2006
| 32443 views |
  • submit to reddit
        Here is a function that accepts a string containing tags and returns an array of extracted tags. (Updated to ignore duplicates)
/**
 * Parses a String of Tags
 *
 * Tags are space delimited. Either single or double quotes mark a phrase.
 * Odd quotes will cause everything on their right to reflect as one single
 * tag or phrase. All white-space within a phrase is converted to single
 * space characters. Quotes burried within tags are ignored! Duplicate tags
 * are ignored, even duplicate phrases that are equivalent.
 *
 * Returns an array of tags.
 */
function ParseTagString($sTagString)
{
	$arTags = array();		// Array of Output
	$cPhraseQuote = null;	// Record of the quote that opened the current phrase
	$sPhrase = null;		// Temp storage for the current phrase we are building
	
	// Define some constants
	static $sTokens = " \r\n\t";	// Space, Return, Newline, Tab
	static $sQuotes = "'\"";		// Single and Double Quotes
	
	// Start the State Machine
	do
	{
		// Get the next token, which may be the first
		$sToken = isset($sToken)? strtok($sTokens) : strtok($sTagString, $sTokens);
		
		// Are there more tokens?
		if ($sToken === false)
		{
			// Ensure that the last phrase is marked as ended
			$cPhraseQuote = null;
		}
		else
		{		
			// Are we within a phrase or not?
			if ($cPhraseQuote !== null)
			{
				// Will the current token end the phrase?
				if (substr($sToken, -1, 1) === $cPhraseQuote)
				{
					// Trim the last character and add to the current phrase, with a single leading space if necessary
					if (strlen($sToken) > 1) $sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . substr($sToken, 0, -1);
					$cPhraseQuote = null;
				}
				else
				{
					// If not, add the token to the phrase, with a single leading space if necessary
					$sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . $sToken;
				}
			}
			else
			{
				// Will the current token start a phrase?
				if (strpos($sQuotes, $sToken[0]) !== false)
				{
					// Will the current token end the phrase?
					if ((strlen($sToken) > 1) && ($sToken[0] === substr($sToken, -1, 1)))
					{
						// The current token begins AND ends the phrase, trim the quotes
						$sPhrase = substr($sToken, 1, -1);
					}
					else
					{
						// Remove the leading quote
						$sPhrase = substr($sToken, 1);
						$cPhraseQuote = $sToken[0];
					}
				}
				else
					$sPhrase = $sToken;
			}
		}
		
		// If, at this point, we are not within a phrase, the prepared phrase is complete and can be added to the array
		if (($cPhraseQuote === null) && ($sPhrase != null))
		{
			$sPhrase = strtolower($sPhrase);
			if (!in_array($sPhrase, $arTags)) $arTags[] = $sPhrase;
			$sPhrase = null;
		}
	}
	while ($sToken !== false);	// Stop when we receive FALSE from strtok()
	return $arTags;
}

The string can be recreated from the array with the use of this reverse function:
/**
 * Reverses ParseTagString()
 */
function CreateTagString($arTags)
{
	// Prepare each tag to be imploded
	for ($i = 0; $i < sizeof($arTags); $i++)
	{
		// Record findings
		$bContainsWhitespace = false;	// Was whitespace found?
		$cRequiredQuote = '"';			// Use double-quote by default
		$cLastChar = null;
	
		// Search the tag
		for ($j = 0; $j < strlen($arTags[$i]); $j++)
		{
			$c = $arTags[$i][$j];
			
			// If the current character is a space
			if ($c === ' ')
			{
				$bContainsWhitespace = true;
				
				// If the previous char was a double quote, we require single quotes round our phrase
				if ($cLastChar === '"')
				{
					$cRequiredQuote = "'";
					break;	// There is no more point in continuing our search, we cant handle double-mixed quotes
				}
			}
			
			// Record this char as the last char
			$cLastChar = $c;
		}
		
		// Quote if necessary
		if ($bContainsWhitespace) $arTags[$i] = $cRequiredQuote . $arTags[$i] . $cRequiredQuote;
	}
	return implode(' ', $arTags);
}

To test the whole system, use the following array of test cases:
$arTestInputs = array(
	"this test ensures that words are correctly split",
	"in this test \"phrases\" and \"multi-word phrases\" are tested",
	"this test shows the behaviour if an \"odd quote is detected",
	"this test shows that 'different quotes' work too",
	"but mixed quotes fail: \"test phrase' does not stop on the quote",
	"which can be usefull in some cases where \"the systems' requirements\" state that it is necessary",
	"quotes need not be attached to \" their phrase \"",
	"embedded\"quotes are ignored!",
	"this is also usefull and demonstrates the system's coolness",
	"redundant   white-space is   removed from \"  tags    and phrases\"",
	"\"\"double quotes\"\" will result in single quotes!",
	"remember that 'double-quotes\" may be nested within single quotes'",
	"TaGs ArE NOT case SENsITiVE!",
	"a duplicate tag will be removed from the tag list",
	"even a \" complex phrase\" that is equivalent to another 'compleX   PHrASe   '"
);

foreach ($arTestInputs as $sTest)
{
	print ("<pre>$sTest</pre>");
	print "<pre>";
	print_r (ParseTagString($sTest));
	print "</pre>";
	print "<pre>";
	print CreateTagString(ParseTagString($sTest));
	print "</pre>";
	print "<hr />";
}

2006-03-09 0.1.0 - 0.2.0 Duplicate phrases are now ignored.

-- 
Version 0.2.0 - 2006-03-09
STEM: The STEM Cells of PHP
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License
http://creativecommons.org/licenses/by-sa/2.5/    

Comments

Snippets Manager replied on Sun, 2008/04/06 - 8:42pm

This is great -- Thanks for posting.

Snippets Manager replied on Wed, 2006/02/22 - 9:58am

nice job!