DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Cleaning Strings With Regular Expressions

11.25.2008
| 5705 views |
  • submit to reddit
        Firsly, to get rid of all non ascii characters.

=> text = "Normal ©®»λαβstring"
"Normal ©®»λαβstring"
=> stripped = text.chars.gsub(/[^\x20-\x7E]/, '')
"Normal string"

Now lets get rid of html tags.

# strip html tags
def strip_html(str, preserve_tags = ['p'])
  return '' unless str.is_a?(String)

  str = str.strip || ''
  preserve_el = preserve_tags.join('|') << '|\/'
  str.chars.gsub(/<(\/|\s)*[^(#{preserve_el})][^>]*>/,'')
end

=> text = "<p>This is a <a href=\"http://www.example.com\">link</a> and a <span>span</span></p>"
"<p>This is a <a href=\"http://www.example.com\">link</a> and a <span>span</span></p>"
=> stripped = strip_html(text)
"<p>This is a link and a span</p>"
=> stripped = strip_html(text, [])
"This is a link and a span"

Finally, lets compact some whitespace to ensure that at most, one space remains between two words.
=> text = " This is   some  text with strange           spacing patterns   "
" This is   some  text with strange           spacing patterns   "
=> stripped = text.chars.gsub(/\s{2,}/,'').strip
"This is some text with strange spacing patterns"