On Jun 16, 2020 a security advisory for Ruby Sanitize library was released about an issue that could lead to complete bypass of the library in its
RELAXED config. I have found this bug during a penetration test conducted by Securitum, and in this post I'll explain how I came up with the idea of the bypass and show how it worked.
This section contains basic information about the idea of HTML sanitizers and how they work. If you're familiar with this topic, you can safely skip to the next section.
Sanitize is a Ruby library that acts as an HTML sanitizer. Its goal is to take an untrusted HTML markup, remove potentially unsafe elements and attributes, and produce new, "safe" HTML. Actually,
sanitize is allowlist-based, which means that it removes all elements and attributes that are not contained in a list. You can define your own allow-list (for instance, you may want to allow only
<i> tags), but there are a few predefined ones. In this post I'll specifically talk about config named
RELAXED with the following list of allowed elements:
a abbr address article aside bdi bdo blockquote body br caption cite code col colgroup data dd del dfn div dl dt figcaption figure footer h1 h2 h3 h4 h5 h6 head header hgroup hr html img ins kbd li main mark nav ol p pre q rp rt ruby s samp section small span strike style sub summary sup sup table tbody td tfoot th thead time title tr ul var wbr
Usually the work of a sanitizer can be summarized in three steps:
As an example, let's see how a sanitizer would process the following markup:
ABC<script>alert(1)</script><img src=1 onerror=alert(2)>
It is parsed into the following DOM tree:
script tag and
onerror attribute are not in the allowlist, and would get deleted. This would result in the following DOM tree:
And after serialization:
To summarize, thanks to sanitization the initial unsafe HTML was rewritten to a safe form.
The allowlist of
sanitize library contains the
style element; this looked promising because it exhibits some specific behavior. First of all, the HTML parser doesn't decode HTML entities within the
<style>. Consider the following HTML:
<div>I <3 XSS</div> <style>I <3 XSS</style>
It gets parsed to the following DOM tree:
< entity is decoded to
<div> but it is not decoded at all within
Similarly, serialization of text content of most elements yields escaped text (that is,
&"<> are substituted with
&"<> respectively) but for some elements,
<style> included, the text is written literally.
Consider the following DOM tree:
It is serialized to:
<div>I <3 XSS</div> <style>I <3 XSS</style>
< was escaped to
<div> but not in
The fact that text is not escaped in
<style> has an important effect: if the text content of
<style> is equal to
</style><img src onerror=alert(1)>, as in the following DOM tree:
Then it is serialized to:
<style></style><img src onerror=alert(1)>
And when the serialized form is parsed again, an XSS is introduced. This risk is directly mentioned in the HTML specification:
The question that remains is: how to put text
The answer is...
The HTML specification has a pandora box of security issues called foreign content. In a nutshell, foreign content is introduced when
<math> tag is opened in the markup; and the parsing changes drastically. The two characteristics of
<style> I mentioned in the previous section are no longer true. If
<style> is descendant of
<math>, then HTML entities are decoded on parsing as well as the text is HTML-encoded on serialization.
To show it a simple example: we start with the following markup:
<svg><style>I <3 XSS
It is parsed to the following DOM tree:
< was decoded to
< which was not the case when
<style> was outside of
<svg>. The DOM tree is serialized back to:
<svg><style>I <3 XSS</style></svg>
< was not encoded to
<style>, the rule no longer applies to
<style> being a child of
Going back to the point, let's see how the Sanitize library could be bypassed. The
RELAXED config allowed
<style> element but not
<math> elements. Sanitize use Google Gumbo parser, which is an HTML5-compliant parser, so it is fully aware of all specifics I mentioned above.
Sanitize sanitized CSS rules too, but the easy trick to put arbitrary content in
<style> was just to do it within a comment.
So the bypass was as follows:
<svg><style>/*</style><img src onerror=alert(1)*/
Let's go step-by-step. First, Sanitize parses the markup to the following DOM tree:
<svg> is not in the allow-list and gets deleted. However, the content of
<svg> was left behind. So at this point the DOM tree is as follows:
There is nothing more to sanitize, hence the code is serialized to:
<style>/*</style><img src onerror=alert(1)>*/
And this introduces the XSS.
Ruby Sanitize library had a security issue that made it possible to bypass it completely should the website use the
RELAXED config. After my report, the default behavior of removing
<math> elements was altered so that their content is also removed.
I'd like to thank Ryan Grove (the maintainer of the library) for his quick reaction for my report; the fix was released 24 hours after my initial report. Great work!
Within last year I shared a a few writeups of my bypasses of HTML sanitizers, including: > Write-up of DOMPurify 2.0.0 bypass using mutation XSS > Mutation XSS via namespace confusion – DOMPurify < 2.0.17 bypass While breaking sanitizers is fun and I thoroughly enjoy doing it, I reached a point where I began to think whether I can contribute even more and propose a fix that will kill an entire class of bypasses.
A few days ago, the Anaconda project announced the PyScript framework, which allows Python code to be executed directly in the browser. Additionally, it also covers its integration with HTML and JS code. An execution of the Python code in the browser is not new; the pyodide project has allowed this for a long time...