View on GitHub

Brianary

Stuff I code.

Encoding Markup in .NET

There are at least four ways to escape HTML or XML characters to entities in .NET, but they work quite differently.

TL;DR: In #PowerShell, the best options to encode to HTML or XML are [Security.SecurityElement]::Escape() (minimal) or [Text.Encodings.Web.HtmlEncoder]::Default.Encode() (comprehensive).

Escape

System.Security.SecurityElement.Escape() is the simplest encoder, only escaping & < > " and ' and passing through all other characters unchanged.

This is fine if you want something lightweight (though not as lightweight as doing the five search-and-replace operations yourself), and you don’t want or need any special encoding for any other characters.

Escape Effect

codepoint(s) name encoded? format
U+0022 QUOTATION MARK ✔️ &quot;
U+0026 AMPERSAND ✔️ &amp;
U+0027 APOSTROPHE ✔️ &apos;
U+003C LESS-THAN SIGN ✔️ &lt;
U+003E GREATER-THAN SIGN ✔️ &gt;
all others Basic Multilingual Plane (remaining) (unescaped)

Encode

System.Text.Encodings.Web.HtmlEncoder.Default.Encode() is a more comprehensive encoder to not only encode the bare minimum characters, but also to encode anything outside 7-bit ASCII for compatibility, using hex codepoint entities instead of named entities, which work for both HTML and XML.

Encode Effect

codepoint(s) name encoded? format
U+0000–U+001F C0 Controls ✔️ &#x0;&#1F;
U+0020 SPACE  
U+0021 EXCLAMATION MARK !
U+0022 QUOTATION MARK ✔️ &quot;
U+0023 NUMBER SIGN #
U+0024 DOLLAR SIGN $
U+0025 PERCENT SIGN %
U+0026 AMPERSAND ✔️ &amp;
U+0027 APOSTROPHE ✔️ &#x27;
U+0028 LEFT PARENTHESIS (
U+0029 RIGHT PARENTHESIS )
U+002A ASTERISK *
U+002B PLUS SIGN ✔️ &#x2B;
U+002C–U+003B Basic Latin (partial) , – ;
U+003C LESS-THAN SIGN ✔️ &lt;
U+003D EQUALS SIGN =
U+003E GREATER-THAN SIGN ✔️ &gt;
U+003F–U+007E Basic Latin (remaining printable) ? – ~
U+007F–U+FFFF Basic Multilingual Plane (remaining) ✔️ &#x7F;&#xFFFF;

HtmlEncode

System.Web.HttpUtility.HtmlEncode() only encodes the minimal symbols as named entities (except decimal for apostrophe, for maximum HTML compatibility with extremely old browsers and HTML parsers), and the Latin-1 Supplement as decimal entities. It doesn’t encode any control characters or any characters outside the Latin blocks.

HtmlEncode Effect

codepoint(s) name encoded? format
U+0000–U+001F C0 Controls
U+0020 SPACE  
U+0021 EXCLAMATION MARK !
U+0022 QUOTATION MARK ✔️ &quot;
U+0023 NUMBER SIGN #
U+0024 DOLLAR SIGN $
U+0025 PERCENT SIGN %
U+0026 AMPERSAND ✔️ &amp;
U+0027 APOSTROPHE ✔️ &#39;
U+0028–U+003B Basic Latin (partial) ( – ;
U+003C LESS-THAN SIGN ✔️ &lt;
U+003D EQUALS SIGN =
U+003E GREATER-THAN SIGN ✔️ &gt;
U+003F–U+007E Basic Latin (remaining) ? –
U+0080–U+009F C1 Controls PADAPC
U+00AD–U+00FF Latin-1 Supplement ✔️ &#160;&#255;
U+0100–U+FFFF Basic Multilingual Plane (remaining) Ā – U+FFFF

† not a valid codepoint

🏚️ AntiXssEncoder

System.Web.Security.AntiXss.AntiXssEncoder offered a variety of encoding choices, but was discontinued after .NET Framework 4.8.1.

Encoding is accomplished with any of three methods: HtmlEncode(), XmlAttributeEncode(), or XmlEncode().

These only encode codepoints up through U+00A0 then U+0370 and above as a decimal entity, and mangles a bunch of the characters in the ranges it doesn’t encode, so it’s a bad choice for a number of reasons.

AntiXssEncoder Effect

codepoint(s) name encoded? format notes
U+0000–U+001F C0 Controls ✔️ &#0;&#31;  
U+0020 SPACE ✔️ &#32; XmlAttributeEncode()
U+0020 SPACE   HtmlEncode() & XmlEncode()
U+0021 EXCLAMATION MARK !  
U+0022 QUOTATION MARK ✔️ &quot;  
U+0023 NUMBER SIGN #  
U+0024 DOLLAR SIGN $  
U+0025 PERCENT SIGN %  
U+0026 AMPERSAND ✔️ &amp;  
U+0027 APOSTROPHE ✔️ &apos; XmlAttributeEncode() & XmlEncode()
U+0027 APOSTROPHE ✔️ &#39; HtmlEncode()
U+0028–U+003B Basic Latin (partial) ( – ; some printable 7-bit ASCII
U+003C LESS-THAN SIGN ✔️ &lt;  
U+003D EQUALS SIGN =  
U+003E GREATER-THAN SIGN ✔️ &gt;  
U+003F–U+007E Basic Latin (remaining printable) ? – ~  
U+007F DELETE ✔️ &#127;  
U+0080–U+009F C1 Controls ✔️ &#127;&#159;  
U+00A0 NO-BREAK SPACE ✔️ &#160;  
U+00A1–U+00AC Latin-1 Supplement (partial) ❌️ ¡ – ¬  
U+00AD SOFT HYPHEN ✔️ &#173;  
U+00AE–U+036F Latin (remaining), various extensions ❌️ ® – ͯ see †
U+0370–U+FFFF Basic Multilingual Plane (remaining) ✔️ &#880;&#65535;  

† these blocks: