Inside an XML document, characters can be referenced by their unicode code point using the following syntax :
&#nnnn;
: decimal notation, such asA
forA
.&#xhhhh;
: hexadecimal notation, such asA
forA
.-
There is also a few predefined entities
like
>
for>
,<
for<
,&
for&
, ...
How to unescape these notations in a Ruby application ?
Standard way : using libraries
Either using the Ruby CGI
module :
require 'cgi'
CGI::unescapeHTML("A Ω <")
# => "A Ω <"
Or using the Nokogiri gem :
require 'nokogiri'
Nokogiri::XML.fragment("A Ω <").text
# => "A Ω <"
Custom way : I'll do it myself
One should stick with the existing libraries when there's one available; But if I'm dealing with a slightly different notation, or I want to add other entities, let's see how to implement a custom solution.
Identify the entities
First, I need regular expressions to identify each entity :
/&#[0-9]{2,4};/ # decimal notation
/&#[xX][0-9a-fA-F]{2,4};/ # hexadecimal notation
/&(quot|apos|lt|gt|amp);/ # predefined entities
Let merge these 3 regular expressions into a single one :
/&(#[0-9]{2,4}|#[xX][0-9a-fA-F]{2,4}|quot|apos|lt|gt|amp);/
Convert number to letter
At some point, I'll need to convert a decimal/hexadecimal number to an UTF-8 character. Ruby has the built-in method
Integer#chr
to convert an integer to the corresponding ASCII character :
65.chr # => "A"
Used with no parameter, chr
returns a String with "US-ASCII"
encoding, and so with a number greater than 255, it raises a RangeError
.
To prevent this, the encoding for the output character must be specified.
0x1F642.chr('UTF-8') # => "🙂"
Implementation
Finally, I use the block form of gsub with my regular expression to replace each matching group (I added some parenthesis in the regex to capture the unicode numbers) :
def unescape_xml(input)
input.gsub(/&(#([0-9]{2,4})|#[xX]([0-9a-fA-F]{2,4})|quot|apos|lt|gt|amp);/) do |s|
case s
when '"'; '"' # replace " with "
when '''; "'" # replace ' with '
when '<' ; '<' # replace < with <
when '>' ; '>' # replace > with >
when '&' ; '&' # replace & with &
else
# convert unicode code point (decimal or hexadecimal) to a char
hexa_flag = $1.start_with?('#x')
unicode_number = hexa_flag ? $3.to_i(16) : $2.to_i
unicode_number.chr(Encoding::UTF_8)
end
end
end
A quick check :
unescape_xml("A Ω <")
# => "A Ω <"
Now I have a custom method to replace XML/HTML entities that I can tweak to add entities. If you're curious,
here is the actual implementation of CGI::unescapeHTML
.