DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Decode Html Entities

09.24.2007
| 19601 views |
  • submit to reddit
        use like this :

   print decode_htmlentities("l'eau")

from htmlentitydefs import name2codepoint as n2cp
import re

def substitute_entity(match):
    ent = match.group(2)
    if match.group(1) == "#":
        return unichr(int(ent))
    else:
        cp = n2cp.get(ent)

        if cp:
            return unichr(cp)
        else:
            return match.group()

def decode_htmlentities(string):
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]
    

Comments

Snippets Manager replied on Wed, 2009/04/01 - 6:27pm

http://github.com/sku/python-twitter-ircbot/blob/321d94e0e40d0acc92f5bf57d126b57369da70de/html_decode.py Code complete with doctests.

Snippets Manager replied on Wed, 2009/04/01 - 6:27pm

&#x\w+?; rather, but yea. While we're at it we can simplify r'&(#?)(x?)(\d{1,5}|\w{1,8});' to r'&(#?)(x?)(\w+);' Here's my test: >>> from html_decode import decode_htmlentities >>> u = u'E tu vivrai nel terrore - L'aldilà (1981)' >>> print decode_htmlentities(u).encode('UTF-8') E tu vivrai nel terrore - L'aldilà (1981)

Snippets Manager replied on Wed, 2009/04/01 - 6:27pm

This doesn't take into account hex encoded html entities: &x\w+?; Here is a revision: from htmlentitydefs import name2codepoint as n2cp import re def substitute_entity(match): ent = match.group(3) if match.group(1) == "#": if match.group(2) == '': return unichr(int(ent)) elif match.group(2) == 'x': return unichr(int('0x'+ent, 16)) else: cp = n2cp.get(ent) if cp: return unichr(cp) else: return match.group() def decode_htmlentities(string): entity_re = re.compile(r'&(#?)(x?)(\d{1,5}|\w{1,8});') return entity_re.subn(substitute_entity, string)[0]

Snippets Manager replied on Wed, 2007/10/31 - 11:42am

this works great man :) thanks a bunch!