Blog

Format Express

How to unescape XML/HTML character entities

-
Published on - By

Inside an XML document, characters can be referenced by their unicode code point using the following syntax :

  • &#nnnn; : decimal notation, such as A for A.
  • &#xhhhh; : hexadecimal notation, such as &#x41 for A.
  • There is also a few predefined entities like &gt; for >, &lt; for <, &amp; for &, ...

How to unescape these notations in a Ruby application ?

Standard way : using libraries

Either using the Ruby CGI module :

require 'cgi' CGI::unescapeHTML("&#65; &#x03A9; &lt;") # => "A Ω <"

Or using the Nokogiri gem :

require 'nokogiri' Nokogiri::XML.fragment("&#65; &#x03A9; &lt;").text # => "A Ω <"

Custom way : I'll do it myself

One should stick with the existing libraries when there's one available; But if I'm dealing with a slightly different notation, or I want to add other entities, let's see how to implement a custom solution.

Identify the entities

First, I need regular expressions to identify each entity :

/&#[0-9]{2,4};/ # decimal notation /&#[xX][0-9a-fA-F]{2,4};/ # hexadecimal notation /&(quot|apos|lt|gt|amp);/ # predefined entities

Let merge these 3 regular expressions into a single one :

/&(#[0-9]{2,4}|#[xX][0-9a-fA-F]{2,4}|quot|apos|lt|gt|amp);/
Convert number to letter

At some point, I'll need to convert a decimal/hexadecimal number to an UTF-8 character. Ruby has the built-in method Integer#chr to convert an integer to the corresponding ASCII character :

65.chr # => "A"

Used with no parameter, chr returns a String with "US-ASCII" encoding, and so with a number greater than 255, it raises a RangeError. To prevent this, the encoding for the output character must be specified.

0x1F642.chr('UTF-8') # => "🙂"
Implementation

Finally, I use the block form of gsub with my regular expression to replace each matching group (I added some parenthesis in the regex to capture the unicode numbers) :

def unescape_xml(input) input.gsub(/&(#([0-9]{2,4})|#[xX]([0-9a-fA-F]{2,4})|quot|apos|lt|gt|amp);/) do |s| case s when '&quot;'; '"' # replace &quot; with " when '&apos;'; "'" # replace &apos; with ' when '&lt;' ; '<' # replace &lt; with < when '&gt;' ; '>' # replace &gt; with > when '&amp;' ; '&' # replace &amp; with & else # convert unicode code point (decimal or hexadecimal) to a char hexa_flag = $1.start_with?('#x') unicode_number = hexa_flag ? $3.to_i(16) : $2.to_i unicode_number.chr(Encoding::UTF_8) end end end

A quick check :

unescape_xml("&#65; &#x03A9; &lt;") # => "A Ω <"

Now I have a custom method to replace XML/HTML entities that I can tweak to add entities. If you're curious, here is the actual implementation of CGI::unescapeHTML.