-
-
Notifications
You must be signed in to change notification settings - Fork 819
Description
This issue discusses a performance issue deriving from DOMPurify's dependance on the DOM.
Background & Context
I work on a note-taking app which renders user-provided HTML documents, essentially. Those HTML documents must be sanitized, my goal is to do that as efficiently as possible. The problem is that given how DOMPurify works it's impossible for me to implement this well enough.
Initially I was just doing the following:
- Turn the HTML into a DOM with DOMParser.
- Sanitize the DOM with DOMPurify.
But there are multiple problems with that approach:
- First of all sanitization time scales with the input, and it's synchronous so it blocks the thread, it follows immediately that I just can't run sanitization on the main thread if I want to provide great performance to my users under pretty much all scenarios.
- Secondly when the HTML document is being edited by the user I probably don't actually need to sanitize it fully again, I can just split the HTML into top-level tags and sanitize each of them individually, with some caching this means only the top-level tags that changed will be sanitized again, which will be a massive speed up.
I can address the second problem pretty easily, but the first problem just can't be addressed because DOMPurify relies on the DOM, the DOM APIs aren't available in a worker context, userland implementations of DOM APIs like jsdom come with their own major issues (jsdom in particular in my experience is slow and massive), and alternative non-DOM-based HTML sanitization libraries like js-xss I just don't trust, if they work at all.
So assuming it's in DOMPurify's interest to be able to be used efficiently, what should be done to fix this?
Feature
I think the best way to address the issue is the following: DOMPurify already accepts a raw Node as input, if it "only" accepted a relatively simple NodeLike object too, which would be an object implementing a very restricted set of DOM-like APIs, then I as a user could parse my HTML string with a third-party HTML parser, and then provide a simple adaptor to it for DOMPurify. Basically DOMPurify would work largely the same way with potentially little change to its code, and users could run it in workers with relatively little work.
Essentially I think it's fine that DOMPurify needs a DOM-like API, but if it requires the entirety of the DOM APIs then it becomes a problem from a performance perspective, because it just can't be run in a worker.
Additionally it may be a good idea to provide an asynchronous version of the API which yields to the event loop every 5ms or so, so that no matter how big the input string is the thread will never just freeze indefinitely.
I hope these potential improvements will be implemented, as currently I don't see a way to use DOMPurify with predictable and acceptable performance.