Unofficial Konfabulator Wiki
Advertisement

This is a function that will convert HTML entities (such as  , ", etc.) into their Unicode string representations. If you're scraping some sort of HTML or XML, it would be smart to run this function on your strings after you've pulled out the text you want.

This function converts entities in decimal, hex, and keyword formats. It also replaces Windows Latin-1 (CP 1252) characters with their correct Unicode equivalents. If you are having problems with some characters, you may want to try disabling CP 1252 conversion. Just replace the lines defining the commonASCII object with const commonASCII = new Object();.


Usage

Call deEnt(myString). The value returned is the de-entified string.


\u0396\u03b1\u03a1\u03b1\u03a3

Reverse

Call reEnt(myString). The value returned is the re-entified string.

\u0396\u03b1\u03a1\u03b1\u03a3

function reEnt(s, heavy)
{
  var safeChars;
  var r = "";
  if (heavy) {
    // Heavy - convert EVERY character into an HTML entity
    safeChars = /[]/;
  } else {
    // Edit the line below to add more "safe" (non-entified) characters
    safeChars = /[a-zA-Z0-9 \-]/;
  }
  for (var i = 0; i < s.length; i++) {
    if (safeChars.test(s.charAt(i))) {
      r += s.charAt(i);
    } else {
      r += "&#" + s.charCodeAt(i) + ";";
    }
  }
  
  return r;
}
Advertisement