Gerhard Fiedler wrote:
> Not really a problem, I think. 
> 
> <basePath>/./file
> <basePath>/../file
> <basePath>/../../path/path/file
> <basePath>//path/file
> 
> These should all work fine. (Note that duplication of the slash is
> generally not a problem. Depends on the file system of the server, but
> at least with Windows and Linux systems this works.)

Actually, it doesn't (or at least it shouldn't) depend on the filesystem at
all, only on the web server software. Only an extremely primitive server
would pass URL paths directly to the filesystem without doing some serious
sanitizing of its own first -- it's a huge security hole, otherwise.

> That's what a relative path is: it's the path portion after the base
> path. So I don't really understand what you think you may need to parse.
> Just append the relative path to the base path (with a trailing slash,
> e.g. "http://myserver/myTopLevelDir/") and you should be done.
> 
> To find out whether a path is relative or absolute depends on what kind
> of paths you can expect on the input. If it's either a complete http URL
> (starting with "http:") or a relative path, then that's it: check for a
> starting "http:".

The presence or absence of the optional scheme field does not determine
whether a URL path is absolute or relative -- it's the leading slash after
that (single or double) that tells you.

> It seems to me that a regex parser is a bit overboard for this :)

Yes.

What's actually needed here is a specific "URL parser" function that can
separate a URL into its components, any of which are optional:

   scheme, user, password, host, port, path, item, fragment, query

i.e.,

   scheme://user:password@host:port/path/item#fragment?query

along with an indication of whether the path is absolute or relative.

Note that some of these components are specific to the "http" scheme.
Note also that most documentation (e.g., the RFCs) do not mention what
I'm calling "item" here (the last name in the path) -- but in most cases,
it's significant enough to call it out separately.

For simplicity, in what follows, I'll just use the word "server" to refer
to the collection of user, password, host and port (RFC2396 calls this the
"authority component"), and "item" will include any fragment or query.

In the completely general case, you process three items in sequence to
figure out what the pieces of the final URL need to be:

1. the URL of the original XML file

   This gives you a default scheme, server, and path; ignore the item
   (although you may need it if the partial URL begins with "#").

2. the base URL given inside the XML file
   (in the general case, this may be optional)

   This may or may not update the scheme and/or server components.
   In any case, if it's absolute, you replace the original path component;
   otherwise, you combine the two together.

3. the "partial" URL for a new item given inside the XML file

   This gives you at a minimum the item (which might be just a fragment).
   This may or may not also update the scheme and/or server components.
   And again, if it's absolute, you replace the original path component;
   otherwise, you combine the path given here with the results of the
   previous step.

Only then can you put the components back together correctly in order to
create the full absolute URL for the new item.

I'd still let the server worry about any /../ or /./ in the final path.

-- Dave Tweed
-- 
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist