-
Notifications
You must be signed in to change notification settings - Fork 98
Closed
Description
I noticed that there are websites that include newlines and leading or trailing whitespaces in canonical links. Linkedom does not automatically trim the results. Here is an example:
import { parseHTML } from 'linkedom';
const html = `<!DOCTYPE html><html><head><title>Test</title><link rel='canonical' href=" https://example.com/
"></head><body></body></html>`;
const { document } = parseHTML(html);
console.log(`canonical link: (not trimmed) length=${document.querySelector('html > head > link[rel="canonical"]').href.length}`);
console.log(`canonical link: trimmed length=${document.querySelector('html > head > link[rel="canonical"]').href.trim().length}
My understanding is that according to the spec, it should be trimmed:
https://url.spec.whatwg.org/#dom-url-href is quite long, but states:
The protocol setter steps are to basic URL parse the given value, followed by U+003A (:), with this’s URL as url and scheme start state as state override.
And the mentioned "basic URL parse":
Remove any leading and trailing C0 control or space from input.
Firefox and Chrome also trim.
I can take a look and create a pull request. But I wanted to confirm first that trimming is the desired results in Linkedom for this example.
Metadata
Metadata
Assignees
Labels
No labels