Skip to main content

Stripping HTML from text in SQL Server–Version 3

 

I’ve used the HTML stripping function for SQL Server available in lazycoders.blogspot.com, which is the second version of the originally published in blog.sqlauthority.com. But neither one removes the comments in this case:

<!-- <b>hello world</b> --> Hello

which is more or less the code that MS Word generates.

Well, the function with that fixed is this (changes are in bold):

ALTER FUNCTION [dbo].[DeHtmlize]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

-- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;amp;')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between comments <!-- -->
SET @Start = CHARINDEX('<!--', @HTMLText)
SET @End = CHARINDEX('-->', @HTMLText, CHARINDEX('<!--', @HTMLText))
SET @Length = (@End - @Start) + 3

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<!--', @HTMLText)
SET @End = CHARINDEX('-->', @HTMLText, CHARINDEX('<!--', @HTMLText))
SET @Length = (@End - @Start) + 3
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END

GO

Hope this helps. Enjoy!

Comments

  1. Hello, great post, It still leaves the following in the field:

    CHAR(13) + CHAR(10).

    Can you update the function to remove this as well?

    Many Thanks.

    Patrick

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi There,
    This function is working fine for converting output to text.
    But we need some tags allowed in the text to format it like
    <b>,<li>, <font>, <a> with there closing matching tags.
    Please help.

    Regards

    ReplyDelete

Post a Comment

Popular posts from this blog

Making Celery 4 work in Django 1.10 and Elastic Beanstalk

Finally after many many days of trying to make it work and reading thousand of pages, I got Celery working with django 1.10 in Amazon AWS Elastic Beanstalk with SQS (Simple Queue Services) – including Celery Beat!. First, the files I ended up with, then the explanation of what I understand (some of those things still remain being a mystery) STEP 0: Install using the following: pip install -U celery[sqs] pip install django-celery-beat I’m using the following versions of the apps: boto (2.45.0) botocore (1.4.63) celery (4.0.2) Django (1.10.1) django-celery-beat (1.0.1) kombu (4.0.2) pip (9.0.1) pycurl (7.43.0) FILE: /src/PROJECT_NAME/celery.py from __future__ import absolute_import , unicode_literals import os from celery import Celery # set the default Django settings module for the 'celery' program. # DONE IN __init__.py os . environ . setdefault ( "DJANGO_SETTINGS_MODULE" , "PROJECT_NAME.settings.production" ) app = Celery ( 'PR...

NHibernate: Using multiple database schemas with SchemaExport

Let's say you want to use NHibernate, and have tables on multiple schemas on your database. I'm using SQL Server, and my quick solution is not supported by all the databases NH supports, but it shouldn't be hard to change. If you run the SchemaExport, you will receive an error telling you that the DB schema does not exist or you have no access to it. It makes sense since the utility does not handle multiple schemas. Well, here's a simple code snipped I've written after debugging and digging into the NH source code: private void CreateSchema() { var provider = ConnectionProviderFactory.NewConnectionProvider(NHConfigs.Properties); using (var connection = provider.GetConnection()) { using (var cmd = connection.CreateCommand()) { cmd.CommandText = "IF (SCHEMA_ID('your_schema_name') IS NULL) EXEC('CREATE SCHEMA your_schema_name')" ; cmd.ExecuteNonQuery(); } provider.CloseConnect...