View on GitHub

Brianary

Stuff I code.

Encoding Markup in .NET

There are at least four ways to escape HTML or XML characters to entities in .NET, but they work quite differently.

TL;DR: In #PowerShell, the best options to encode to HTML or XML are [Security.SecurityElement]::Escape() (minimal) or [Text.Encodings.Web.HtmlEncoder]::Default.Encode() (comprehensive).

Escape

System.Security.SecurityElement.Escape() is the simplest encoder, only escaping & < > " and ' and passing through all other characters unchanged.

This is fine if you want something lightweight (though not as lightweight as doing the five search-and-replace operations yourself), and you don't want or need any special encoding for any other characters.

Escape Effect

codepoint(s)	name	encoded?	format
U+0022	QUOTATION MARK	✔️	`"`
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+003C	LESS-THAN SIGN	✔️	`<`
U+003E	GREATER-THAN SIGN	✔️	`>`
all others	Basic Multilingual Plane (remaining)	❌	(unescaped)

Encode

System.Text.Encodings.Web.HtmlEncoder.Default.Encode() is a more comprehensive encoder to not only encode the bare minimum characters, but also to encode anything outside 7-bit ASCII for compatibility, using hex codepoint entities instead of named entities, which work for both HTML and XML.

Encode Effect

codepoint(s)	name	encoded?	format
U+0000–U+001F	C0 Controls	✔️	`` – `&#1F;`
U+0020	SPACE	❌
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+0028	LEFT PARENTHESIS	❌	(
U+0029	RIGHT PARENTHESIS	❌	)
U+002A	ASTERISK	❌	*
U+002B	PLUS SIGN	✔️	`+`
U+002C–U+003B	Basic Latin (partial)	❌	, – ;
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining printable)	❌	? – ~
U+007F–U+FFFF	Basic Multilingual Plane (remaining)	✔️	`` – ``

HtmlEncode

System.Web.HttpUtility.HtmlEncode() only encodes the minimal symbols as named entities (except decimal for apostrophe, for maximum HTML compatibility with extremely old browsers and HTML parsers), and the Latin-1 Supplement as decimal entities. It doesn't encode any control characters or any characters outside the Latin blocks.

HtmlEncode Effect

codepoint(s)	name	encoded?	format
U+0000–U+001F	C0 Controls	❌	␀ – ␟
U+0020	SPACE	❌
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+0028–U+003B	Basic Latin (partial)	❌	( – ;
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining)	❌	? – ␡
U+0080–U+009F	C1 Controls	❌	PAD – APC
U+00AD–U+00FF	Latin-1 Supplement	✔️	` ` – `ÿ`
U+0100–U+FFFF	Basic Multilingual Plane (remaining)	❌	Ā – U+FFFF†

† not a valid codepoint

🏚️ AntiXssEncoder

System.Web.Security.AntiXss.AntiXssEncoder offered a variety of encoding choices, but was discontinued after .NET Framework 4.8.1.

Encoding is accomplished with any of three methods: HtmlEncode(), XmlAttributeEncode(), or XmlEncode().

These only encode codepoints up through U+00A0 then U+0370 and above as a decimal entity, and mangles a bunch of the characters in the ranges it doesn't encode, so it's a bad choice for a number of reasons.

AntiXssEncoder Effect

codepoint(s)	name	encoded?	format	notes
U+0000–U+001F	C0 Controls	✔️	`` – ``
U+0020	SPACE	✔️	` `	`XmlAttributeEncode()`
U+0020	SPACE	❌		`HtmlEncode()` & `XmlEncode()`
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`	`XmlAttributeEncode()` & `XmlEncode()`
U+0027	APOSTROPHE	✔️	`'`	`HtmlEncode()`
U+0028–U+003B	Basic Latin (partial)	❌	( – ;	some printable 7-bit ASCII
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining printable)	❌	? – ~
U+007F	DELETE	✔️	``
U+0080–U+009F	C1 Controls	✔️	`` – ``
U+00A0	NO-BREAK SPACE	✔️	` `
U+00A1–U+00AC	Latin-1 Supplement (partial)	❌️	¡ – ¬
U+00AD	SOFT HYPHEN	✔️	``
U+00AE–U+036F	Latin (remaining), various extensions	❌️	® – ͯ	see †
U+0370–U+FFFF	Basic Multilingual Plane (remaining)	✔️	`Ͱ` – ``

† these blocks:

Latin-1 Supplement (remaining)
Latin Extended-A
Latin Extended-B
IPA Extensions
Spacing Modifier Letters
Combining Diacritical Marks