파이썬 Unescape 함수 : Unescape HTML Entities in Python

한시간은 찾아다닌 모양입니다. 드디어 찾았네요.

아래쪽 녹색 박스의 스크립트를 사용하시면 됩니다. ^^+

집에가자..!!

I'm not a programmer, I mean a real programmer. I do hack code sometimes, mostly python, to be able to process files for recurring tasks. I had to read XHTML files to send to an XML parser (ElementTree).

This piece of code might be useful for someone and there will be certainly people who think that it is really programmed with my feet and suggest fixes. I usually work in UTF-8 but sometimes there might be character references and named entity in my files. So I wanted to convert everything to UTF-8 characters.

def parsefile(path):
   try:
      file = open(path, "r")
      fileread = file.read()
      fileread = unescape(fileread.decode('utf-8')).encode('utf-8')
      file.close()
   except:
      print "Reading File Bug"
      sys.exit(1)
   return ET.fromstring(fileread)

The Unescape HTML Entities routine was found on Frederik Lundh Web site. The code was doing too much because it was converting &, > and <. I wanted to keep those in URL and where I have escaped code sections. So I slightly modified it for my own needs.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
      keep &amp;, &gt;, &lt; in the source code.
   from Fredrik Lundh
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "erreur de valeur"
            pass
      else:
         # named entity
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

Hope it helps.

저작자표시 (새창열림)

'프로그램&DB > Python' 카테고리의 다른 글

ASP에서 python 사용하기 (0)	2012.05.03
파이썬 int 정수형을 string 문자열로 변환하는 함수 How to Convert Int to String in Python (0)	2012.04.17
파이썬 메일 읽어오기 (0)	2012.04.09
파이썬 시작시 처리 사항 (0)	2012.04.09
날짜시간처리 (0)	2012.04.09

WVC-VisionFlag

파이썬 Unescape 함수 : Unescape HTML Entities in Python

'프로그램&DB > Python' 카테고리의 다른 글

티스토리툴바

파이썬 Unescape 함수 : Unescape HTML Entities in Python

'프로그램&DB > Python' 카테고리의 다른 글

'프로그램&DB/Python' Related Articles

티스토리툴바