It is incorrect to "normalize" // in HTTP URL paths

62 points by pabs3 19 hours ago on hackernews | 60 comments

WesolyKubeczek | 17 hours ago

It is probably “incorrect”, but given the established actual usage over the decades, it’s most likely what you need to do nevertheless.

Not doing it is like punishing people for not using Oxford commas, or entering an hour long debate each time someone writes “would of” instead of “would have”. It grinds my gears too, but I have different hills to die on.

Etheryte | 17 hours ago

Not sure I agree. The correct thing is to not mess with the URL at all if you're unsure about what to be doing to it. Doing nothing is the easiest thing of them all, why not do that?

j16sdiz | 16 hours ago

because the you need some consistency or normalisation before applying ACL or do routing?

jeroenhd | 16 hours ago

URL normalization is defined and it doesn't include collapsing slashes.

Not that you can include custom normalization rules (like collapsing slashes, tolower()ing the entire path, removing the query part of the URL), but that's not part of the standard. If you're doing anything extra, the risk of breaking stuff is on you.

Etheryte | 2 hours ago

If someone gives you a nonsense URL, the correct response is 404, not to try and guess what they could've maybe meant.

bazoom42 | 16 hours ago

If different clients does it differently, you have incompatibilies. This punishes everybody. Since normalizing // to / removes information which may be significant, the obviously correct choice is folllowing the spec.

PunchyHamster | 16 hours ago

if it is significant, you coded your app wrong, plain and simple

jeroenhd | 16 hours ago

Of course not. It's an explicit feature part of every specification.

Plenty of websites rewrite paths like /a/b/c/d into a backend service call like /?w=a&x=b&y=c&z=d. In that scheme, /a//c/d would rewrite to /?w=a&x=&y=c&z=d, something entirely distinct from /a/c/d working out to /?w=a&x=b&y=c

It's not the application's fault that the people attempting to configure web server URLs don't know how web server URLs work.

WesolyKubeczek | 3 hours ago

A sane configuration, of course, would collapse the slashes first, so it would be /?w=a&x=c&y=d&z=.

See, it’s because when we do these acrobatics with turning path elements into query parameters, we do it for humans, so they are more readable. Humans can make typos, and accidentally entering two slashes instead of one is not exactly unheard of.

If we do it for some other code, we shouldn’t be rewriting anything at all, and just use query parameters.

bazoom42 | 16 hours ago

Why?

MattJ100 | 17 hours ago

URL parsing/normalisation/escaping/unescaping is a minefield. There are many edge cases where every implementation does things differently. This is a perfect example.

It gets worse if you are mapping URLs to a filesystem (e.g. for serving files). Even though they look similar, URL paths have different capabilities and rules than filesystems, and different filesystems also vary. This is also an example of that (I don't think most filesystems support empty directory names).

mjs01 | 17 hours ago

// is useful if the server needs to serve both static files in the filesystem, and embedded files like a webpage. // can be used for embedded files' URL because they will never conflict with filesystem paths.

PunchyHamster | 16 hours ago

....just serve it from other paths

sfeng | 16 hours ago

What I’ve learned in doing this type of normalization is whatever the specification says, you will always find some website that uses some insane url tweak to decide what content it should show.

PunchyHamster | 16 hours ago

We cut those and few others coz historically there were exploits relying on it

Nothing on web is "correct", deal with it

bensyverson | 11 hours ago

Yeah, I don’t get the point of these RFC “gotcha” posts. For instance, an email address can include the local (username) part in quotes, the domain can be a bracketed IP, the domain can include comments in parentheses, etc.

In practice, NO one uses weird forms like this, because it would be impossible to use most online services. Supporting pathological edge cases has literally no upside, but plenty of downside.

> I don’t get the point of these RFC “gotcha” posts

The author is pretty clearly trying to rely on something that is guaranteed by the spec (zero-length path components in a URI) but frustrated by poorly behaving implementations that take it for granted that it's okay to assume after spotting runs of consecutive slashes that they can be "normalized" into a single U+002F, even though it's not okay to assume that.

It's not a contrived, academic, "gotcha post". This person is frustrated, and it's not hard to make that out.

> In practice, NO one uses weird forms like this, because it would be impossible to use most online services.

That's not true. It's not as if buggy middleware is but one thing standing in their way of obtaining what they want among a sea of other obstacles that will loom in front immediately after the previous was overcome—and even if it were, they'd still be right to call them out; it really is the one thing causing them issues in pursuit of their use case. Web browsers cope with these URIs just fine (doing as they're supposed to).

bensyverson | 7 hours ago

> It's not a contrived, academic, "gotcha post". This person is frustrated, and it's not hard to make that out.

Where in the article do you get that impression? The closest I can see is "Sometimes it’s useful to have a separator between different parts of a path."

I get the author's point that the zero-width behavior should be supported, and I don't begrudge them getting the word out. But in the end, if a technically correct syntax is not widely supported, you have to choose whether that syntax is actually something you can depend on.

For example, RFC 3986 (URI) does not define a max length for the fragment (hash). It can be 2MB, it can be 16TB. I ran into the actual limits when I tried to store image data in the fragment and promptly crashed Safari (CVE-2013-0983). What did I do next? I abandoned that half-baked idea, because the amount of storage available for the fragment was completely undefined.

dale_glass | 16 hours ago

But maybe you should anyway.

Because maybe you use S3, which treats `foo/bar.txt` and `foo//bar.txt` as entirely separate things. Because to S3, directories don't exist and those are literally the exact names of the keys under which data is stored.

So you have script A concatenate "foo" + "/bar" and script B concatenate "foo/" + "/bar", and suddenly you have a weird problem.

I can't imagine a real use case where you'd think this is desirable.

secondcoming | 15 hours ago

If a user of S3 knows that directories aren't real why would they expect directory-related normalisation to happen?

dale_glass | 5 hours ago

Precisely because of it. On Linux, /bin/bash, //bin/bash and /bin//bash are the exact same file, the same inode. They look somewhat off to people, but they're entirely harmless, so cleaning that up is an aesthetic choice, not something important.

On S3 they're different. Using the wrong paths causes weird issues, like not finding things you expect you find, or storing multiple versions of the same data out of sync.

Normalizing // to / means making S3 behave more like people expect.

Mordisquitos | 14 hours ago

> I can't imagine a real use case where you'd think this is desirable.

Not S3, but here's a literal real use case: the entry for the Iraqw word /ameeni (woman) in Wiktionary.

https://en.wiktionary.org/wiki//ameeni

If for whatever reason your S3 keys contained English words and their translations separated by a slash, you would have a real problem if one of your scripts were to concatenate woman, / and /ameeni as woman/ameeni instead of woman//ameeni in the English/Iraqw case.

kstrauser | 14 hours ago

If you’re working with a use case where that’s even possible, you need to URL-encode it like

  woman/%2Fameeni
Consider that if the language allowed trailing slashes. What would this path mean if ameeni/ happened to be a valid word?

  ameeni//ameeni
One of those would get the slash but it’s not clear which.

W3C says:

> The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical.

zarzavat | 13 hours ago

Sounds like a Unicode problem. U+002F is not a letter codepoint and it's not appropriate to use as a letter given its history of being used for path separation. Iraqw slash should have its own code point.

Can they not just use a 3 like in Arabic?

realitylabs | 13 hours ago

This exact issue has derailed our main document store for the past several years. We have written a couple supporting applications specifically to address the fallout from this issue.

jgworks | 36 minutes ago

Another weird thing is that `foo/` is a valid object name in s3, so you can have both `foo/` and `foo` be different files.

leni536 | 16 hours ago

Wait until you try http:/example.com and http://////example.com in your browser.

stanac | 15 hours ago

In both cases I get https://example.com/ in FF.

tremon | 13 hours ago

Your first example is a valid uri but not a valid http url, because it's missing a host part. Your second example is not a valid uri, as the spec requires that [scheme]:// is followed by a host indicator.

Neither has much to do with / normalization, which applies to the path part of a valid uri.

janmarsal | 15 hours ago

i'm gonna do it anyway

renewiltord | 15 hours ago

I’m going to keep doing it.

echoangle | 15 hours ago

> Wait, are there any implementations that wrongly collapse double-slashes?

> nginx with merge_slashes

How can it be wrong if it is server-side? If the server wants to treat those paths equally, it can if it wants to.

It would only be wrong if a client does it and requests a different URL than the user entered, right?

leni536 | 15 hours ago

It can't be. It's the same confusion as "email address normalization" being wrong (for example when gmail ignores dots when mapping an address to an inbox).

It matters where the normalization happens, and server-side behavior is out-of-scope of these identifier RFCs.

OoooooooO | 15 hours ago

Yeah I would say that falls under the origin defining both paths as equivalent.

> Therefore, collapsing // to / in HTTP URL path segments is not correct normalization. It produces a different, non-equivalent identifier unless the origin explicitly defines those two paths as equivalent.

nginx is frequently used as a reverse proxy and not "the server" (or only to the extent that it's the client-facing server). Its defaults assume that it's fine to do a "normalization" pass to remove double slash, etc., even though that's potentially out of step with how the actual content/application server wishes to deal with those requests.

echoangle | 13 hours ago

That’s purely a server side configuration issue and has nothing to do with web standards though. There’s nothing that says that the internal communication on the server needs to follow the standards for user agents.

And at least according to this, the default setting is off so nginx actually is compliant unless you manually make it not be:

https://www.oreilly.com/library/view/nginx-http-server/97817...

EDIT: Actually it seems to be on by default:

https://nginx.org/en/docs/http/ngx_http_core_module.html#mer...

> That’s purely a server side configuration issue

When it's the default, it's not a case of someone having configured nginx to do the thing described, as is their prerogative. It's nginx's defaulting to doing the wrong thing and requiring specific configuration to do the right thing. The author's position is that this violates the RFCs.

> and has nothing to do with web standards though

Yes it does. Prescriptions for how intermediate servers are or are not to munge data before passing it to the origin server are written directly into the HTTP RFCs. They're filled with references to this.

> There’s nothing that says that the internal communication on the server needs to follow the standards for user agents.

And is there anyone arguing that that's the case here?

echoangle | 12 hours ago

> When it's the default, it's not a matter of someone configuring nginx to do the wrong thing. It's nginx's defaulting to doing the wrong thing and requiring specific configuration to do the right thing.

This assumes that „the reverse proxy requests a different URL upstream from what it got as a request“ is wrong. Who says that it is?

And as I said, it doesn’t seem to be the default. But I can also continue defend it being the default because I think even as a default on it wouldn’t be wrong.

EDIT: Actually it seems to be on by default:

https://nginx.org/en/docs/http/ngx_http_core_module.html#mer...

> Yes it is. Prescriptions for how intermediate servers are or are not to munge data before passing it to the origin server is written directly into the HTTP RFCs. It's filled with references to them.

Which RFC forbids a reverse proxy from rewriting the request URL?

If I have a legacy PHP app that expects values as query strings and I use a reverse proxy to map the URL path to those query strings, is that wrong too? Would it be wrong if my reverse proxy did that by default?

> This assumes that „the reverse proxy requests a different URL upstream from what it got as a request“ is wrong. Who says that it is?

For this case (double/multiple slash "normalization"), the author of this post is saying that—and they're saying RFC 3986 says so, too.

> Which RFC forbids a reverse proxy from rewriting the request URL?

Ibid.

> If I have a legacy PHP app that expects values as query strings and I use a reverse proxy to map the URL path to those query strings, is that wrong too? Would it be wrong if my reverse proxy did that by default?

Clearly, it's not wrong if you selected and/or configured a software package specifically for the purpose of providing that functionality. And clearly it is wrong if it were to do that when not configured to do anything other than act as generic middleware, with that software's creator(s) operating under the assumption that it's safe to do so all while arguing that it's standards-compliant.

echoangle | 12 hours ago

> For this case (double/multiple slash "normalization"), the author of this post is saying that—and they're saying RFC 3986 says so, too.

No. The RFC says that the rewritten URL is not considered the same URL. But nothing says that the reverse proxy has to request the same URL.

The rewrite is not a normalization, but nothing says that the reverse proxy is only allowed to do normalization.

> Clearly, it's not wrong if you selected and/or configured a software package specifically for the purpose of providing that functionality. And clearly it is wrong if it were to do that when not configured to do anything other than act as generic middleware, with that software's creator(s) operating under the assumption that it's safe to do so all while arguing that it's standards-compliant.

It’s not wrong and it is standards-compliant, because no standard says that the default has to be „pass the original URL on without rewriting it“.

mjmas | 12 hours ago

> And at least according to this, the default setting is off

It appears to not default to off on my install (AlmaLinux 10).

I just tested now. Cloudflare normalises ../ and ./ paths and then the nginx proxy appears to normalise // to /:

nginx log:

  1234:: - - [18/Apr/2026:12:59:05 +0000] "GET //test//doubleslash/url HTTP/1.1" 404 158 "-" "curl/8.19.0" "1234::"
lighttpd log:

  1234:: - - [18/Apr/2026:12:59:04 +0000] "GET /test/doubleslash/url HTTP/1.0" 404 158 "-" "curl/8.19.0"

echoangle | 12 hours ago

Actually I think you’re right, here it also says the default is on:

https://nginx.org/en/docs/http/ngx_http_core_module.html#mer...

Thanks for trying!

leni536 | 15 hours ago

I don't think it's incorrect for distinct paths to point to the same resource.

Of course you shouldn't assume that in a client. If you are implementing against an API don't deviate regarding // and trailing / from the API documentation.

domenicd | 14 hours ago

As some others have indirectly pointed out, this article conflates two things:

- URL parsing/normalization; and

- Mapping URLs to resources (e.g. file paths or database entries) to be served from the server, and whether you ever map two distinct URLs to the same resource (either via redirects or just serving the same content).

The former has a good spec these days: https://url.spec.whatwg.org/ tells you precisely how to turn a string (e.g., sent over the network via HTTP requests) into a normalized data structure [1] of (scheme, username, password, host, port, path, query, fragment). The article is correct insofar that the spec's path (which is a list of strings, for HTTP URLs) can contain empty string segments.

But the latter is much more wild-west, and I don't know of any attempt being made to standardize it. There are tons of possible choices you can make here:

- Should `https://example.com/foo//bar` serve the same resource as `https://example.com/foo/bar`? (What the article focuses on.)

- `https://example.com/foo/` vs. `https://example.com/foo`

- `https://example.com/foo/` vs. `https://example.com/FOO`

- `https://example.com/foo` vs. `https://example.com/fo%6f%` vs. `https://example.com/fo%6F%`

- `https://example.com/foo%2Fbar` vs. `https://example.com/foo/bar`

- `https://example.com/foo/` vs. `https://example.com/foo.html`

Note that some things are normalized during parsing, e.g. `/foo\bar` -> `/foo/bar`, and `/foo/baz/../bar` -> `/foo/bar`. But for paths, very few.

Relatedly:

- For hosts, many more things are normalized during parsing. (This makes some sense, for security reasons.)

- For query, very little is normalized during parsing. But unlike for pathname, there is a standardized format and parser, application/x-www-form-urlencoded [2], that can be used to go further and canonicalize from the raw query string into a list of (name, value) string pairs.

Some discussions on the topic of path normalization, especially in terms of mapping the filesystem, in the URL Standard repo:

- https://github.com/whatwg/url/issues/552

- https://github.com/whatwg/url/issues/606

- https://github.com/whatwg/url/issues/565

- https://github.com/whatwg/url/issues/729

-----

[1]: https://url.spec.whatwg.org/#url-representation [2]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...

bryden_cruz | 14 hours ago

This exact ambiguity causes massive headaches when putting Nginx in front of a Spring Boot backend. Nginx defaults to merge_slashes on, so it silently 'fixes' the path. But Spring Security's strict firewall explicitly rejects URLs with // as a potential directory traversal vector and throws an error. It forces you to explicitly decide which layer in your infrastructure owns path normalization, because if Nginx passes it raw, the Java backend completely panics.

jeroenhd | 14 hours ago

What I don't understand about this setup is why a double slash could ever be a directory traversal attack in Spring Boot.

If you're proxying to another server that just assumes relative paths and doesn't do any kind of validation, I guess an extra / might cause reading files outside of the expected area? That'd be an extremely weird and awful setup that I don't think makes any sense in the context of Spring Boot.

Bender | 13 hours ago

NGinx, Kube-NGINX, Apache, Traefik all default to normalizing request paths per reference of RFC 3986 [1]. This behavior can be disabled when requests are proxied to resources on the back-end that require double-slashes. I only reference the RFC to describe what they are talking about, not why they default to merging. They all agreed on a decision as one was not made for them.

To generalize by saying "incorrect" is incorrect. The correct answer is that it depends on the requirements in the given implementation. Making such generalizations will just lead to endless arguing. If there is still any debate then a group must vote to deprecate and replace the existing RFC with a new RFC that requires that merging slashes MUST be either be always enabled or always disabled using verbiage per RFC 2119 [2] and optionally RFC 6919 [3]. Even then one may violate an RFC is there is a need to do so and everyone has verified, documented and signed off that doing so has not introduced any security or other risks in the given implementation and if such a risk is identified that it will be remediated or mitigated in a timely manor.

[Edit] For clarification the reason I am linking to RFC 3986 is that it only defines path characteristics and does not explicitly say what to do or not to do. Arguments will persist until a new RFC is created rather than blog and stack overflow posts. Even then people may violate the RFC if they feel it is safe to do so. I do not know how to reword this to make it less confusing.

[1] - https://datatracker.ietf.org/doc/html/rfc3986

[2] - https://datatracker.ietf.org/doc/html/rfc2119

[3] - https://datatracker.ietf.org/doc/html/rfc6919

embedding-shape | 13 hours ago

> Making such generalizations will just lead to endless arguing

But 80% of all programming blog posts on the internet rely on being able to make sweeping generalizations across the ecosystem! Without this, we basically have nothing left to argue about.

Caring about tradeoffs, contexts, nuance and not just cargoculting our way into a distributed architecture for a app with 10 users just sounds so 90s and early 00s. We're now in the future and we're all outputting the same ̶t̶o̶k̶e̶n̶s̶ code, so obviously what is the solution in my case, surely must be the solution in your case too.

Bender | 12 hours ago

Without this, we basically have nothing left to argue about.

My theory is that the codex [1] was created not to stop arguments but rather to shorten them so that we can find a path forward, get back to work and accomplish some mission.

[1] - https://www.youtube.com/watch?v=nfKFHTaGzuU

Both you and the original author cite the same RFC to support your arguments. Passages from RFC 3986 comprise the bulk of the original post.

The difference between the support for your argument and theirs is that they call out the specific sections in the RFC that they claim are relevant to the issue at hand and your comment only broadly references the RFC by name. In any case, even if they, too, merely gestured to its existence, claiming that it supports their position, then appearing here with a bare claim that RFC 3986 supports the opposing side without further elaboration is not exactly strong candidate for a path to a fruitful resolution.

mjmas | 12 hours ago

Agreed. Reading through the RFC it certainly appears to support the blog article.

And looking around I found this SO answer noting nothing in the RFC:

https://stackoverflow.com/a/24661288

Bender | 12 hours ago

In any case, even if they, too, merely gestured to its existence

That is entirely my point. If the author wants to disable merge slashes then they need to replace the RFC I linked to with one that explicitly says what to do or not do using strong verbiage that is explicit as I explained. Blog articles and Stack Overflow threads will not set a standard.

If people interpret the RFC differently than I in that they feel it is explicit vs vague then please contact all of the web daemon maintainers to have them correct their default behavior. Just know ahead of time that two of them are quite challenging to have these discussions with.

> That is entirely my point. If the author wants to disable merge slashes then they need to replace the RFC I linked to with one that explicitly says what to do or not do using strong verbiage that is explicit as I explained.

That doesn't seem to be the case. You said, "NGinx, Kube-NGINX, Apache, Traefik all default to normalizing request paths per reference of RFC 3986". That's a strong claim, not an appeal to ambiguity.

> Blog articles[…] will not set a standard.

Blog posts absolutely have the power to influence future developments. That's historically how it has worked. "RFC" stands for "Request For Comments".

Bender | 8 hours ago

to influence future developments

This development work is already completed. New web daemons would likely just follow the precident that has been set by the popular daemons as to not cause confusion, unexpected behavior and even more arguments.

If a notable sized group of developers would like to contact all the web daemon maintainers I can list all their contact information. In my experience these developers and F5 are not very open to making sweeping changes but there is mostly no harm in trying. The represenative should be someone thick skinned.

> This development work is already completed.

You're prevaricating. Earlier:

> they need to replace the RFC I linked to with one that explicitly says what to do or not do

Do they need to work on getting the RFC to be more explicit about the correct behavior or not?

Bender | 6 hours ago

It really isn't up to me. If enough developers find this to be an important issue then the first step would be to replace the RFC with a new one and then work with the existing web daemon developers to change their defaults. There should also be an effort to communicate these changes to all the internet companies world wide long in advance as this will be a breaking change for many people. Perhaps I am just jaded but I think this will break a lot of stuff and cause a lot of really bad maintenance windows and other fallout. Who among the developers is willing to take the lead on this? You appear to be very articulate and astute. Are you taking the lead?

bigbadfeline | 5 hours ago

If cxr takes the lead, I'll be happy to help too. I don't have much time but I can provide some support in this monumental struggle.
> It really isn't up to me. If enough developers find this to be an important issue then the first step would be to[…]

That's not what I asked. In one breath, you've said they need to take up that effort. In the next breath, you've said that it's a done deal. I'm asking: which is it?

Making up your mind (instead of perpetually moving the goalposts) is up to you.

nottorp | 12 hours ago

There are still email forms that refuse pluses in email addresses too...

mjmas | 12 hours ago

And there are different rules for the email in the envelope and the message. One allows the user part of the email to contain spaces and the other doesn't.