Strip HTML tags with ASP

This code quickly strips out any HTML tags in a string. It does NOT require a regular expression and so runs quite a bit faster, especially on shorter strings. It works by first replacing all the "<" (which are only present at the start of a new HTML tag) with "><". It does this so that there is a single, consistant character to split the string on, while still leaving the "<" to identify the sections that are HTML. It then splits the string on ">" which cuts each section just before the html tags (as indicated by the "<") and at the end of each tag (as indicated by the closeing ">"). It then filters the resulting array, removing any element that contains a "<". This will be all the elements that were an html tag. The final operation is to simply re-join all the remaining elements, which are the text.

Example: "<i>this is <b>a <a href='test.html'>test</b></a>" (note that the html need not be correct or have matching closing tags.)

><i>this is ><b>a ><a href='test.html'>test></b>></a> (after replacing all "<" with "><")

|<i|this is |<b|a |<a href='test.html'|test|</b||</a| (after splitting on ">"; the | character is used to show the elements of the array)

|this is |a |test|| (after filtering out all the elements with a "<")

this is a test (after joining the remaining elements)

function StripHTML(ByRef asHTML)
	StripHTML = join(filter(split(replace(asHTML, "<", "><"),">"),"<", false))
	End function

You may also want to remove excessive whitespace with:

	set regex = New RegExp
	regex.pattern = "\s+"
	regex.Global = True   ' Set global applicability.
	asHTML = regEx.Replace(asHTML, " ")

And possibly process common strings such as:

	asHTML=replace(asHTML,"&nbsp;"," ")

Comments:

Questions:

Interested: