NexJ Logo

HTML sanitization

HTML sanitization is performed by the ODSWP AntiSamy third party library. The goal is to filter out any potential threats such as cross site scripting, breaking the layout, or phishing attempts. The sanitization rules are based on a white list that contains all the allowed tags, attributes and styles. As browsers improve, new tags, attributes or styles may need to be introduced.

For more information about AntiSamy, see https://github.com/nahsra/antisamy, https://github.com/nahsra/antisamy/wiki, and https://code.google.com/archive/p/owaspantisamy/downloads. For more information about cross site scripting, see https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html.

Refer to the current release notes for the current version of the library.

AntiSamy encodes certain character entities such as <,>, &, and &nbsp; into their respective Unicode forms &lt;, &gt;, &amp;, and \u00a0.

AntiSamy normalizes line endings \r, \n, and \r\n to \n.

Blocked markup

Below is a list of intentionally blocked markup and the reasoning behind it.

JavaScript

All forms of JavaScript are filtered out. This includes but is not limited to JavaScript in <script> tags, external scripts, on* events, and JavaScript in CSS. Allowing the execution of JavaScript should be avoided at all costs, as it poses the greatest threat due to XSS attacks.

Forms and form elements

Forms and form elements (for example, input boxes, submit buttons, checkboxes, and so on) are filtered to prevent any possible phishing attacks. A user could be fooled if a third-party source was able to inject their own forms and trick them into entering their information, thinking the form is from the application.

<iframe> tags

The <iframe> tags are blocked for a couple of reasons. For one, we cannot filter the HTML of the actual iframe unless we made a request for the iframe to get that HTML, which is not practical.

It is also possible that someone inserted an iframe with a source to a website that can break frames and hijack the application by redirecting every user to a website of their choice.

<HTML>, <head>, <body>, <title>, <meta>, and <DOCTYPE> tags

These are all blocked, as there should only be one each of the <HTML>, <head>, <body>, and <title> tags, which is set by the application. The <meta> tags are filtered because they can cause an HTML redirect. <DOCTYPE> tags are filtered because the application will set the DOCTYPE and only one can be defined.

External style sheets

While the AntiSamy library provides the functionality to get external style sheets and validate them, it still would be a very unsafe practice. For example, a clever user could identify the user agent of the crawler that validates the style sheet and always spit out clean CSS. They could then insert malicious CSS for every other type of user agent.

id and name attributes

id and name attributes are blocked because JavaScript relies on these and if some HTML was inserted into the page with duplicate ids or names then it could cause unexpected behavior.

Absolute, relative, and fixed positioning

These CSS values are blocked because it would be possible for someone to position their inserted HTML somewhere else in the application and make it appear like the HTML is from the application. They could also block certain page elements, blocking people from using them (for example, absolutely placing a div over the navigation).

Bad HTML and CSS

Made up elements, styles, attributes, and values are all filtered out. Invalid styles such as "width:5;" (no units) or "madeupstyle: 10px;" are blocked. Unclosed tags are automatically closed, if possible. Tags that try to break out of the container (for example, starting with </div>) are also filtered out.

Images

The <img> tags and url() functions in CSS are filtered out. However, these are just commented out in the white list file because they may be required. The reason for blocking is that it would be possible for images to track user information, each time they view them. They also potentially raise the concern of mixed content warnings. For example, if users are on an HTTPS page (as most would be) and they use an image link that points to an HTTP image, it will throw a mixed content warning.

<style> tag

All <style> tags are removed because <style> tags should not be defined outside the header. Also, if a <style> tag was inserted into the page with CSS such as td { color: red;border: 1px solid black; } then every table cell in the iframe would have a red font and a border.