URI parsing using Bash built-in features

A bit of background

A while ago I posted an article describing how one could parse complete URIs in Bash using the sed program. Since then, I have realized that there is a better way to do it, a much better way: via Bash built-in pattern matching
Here are some benefits of this improvement:

  • It no longer executes external programs (i.e. sed) for pattern matching. This translates to higher speed and lower memory and CPU usages, which means that you could use this parser for much more intense URI crunching.
  • The new regular expressions are drastically simplified thanks to the ${BASH_REMATCH[*]} array that is able to hold more than 9 matched sub-expressions, unlike sed that can only work with single-digit escapes: \1-\9 (yuck!).
  • The parsing algorithm is contained in a single Bash function, so no external file is needed to hold the regular expressions. This also means, obviously, that the pattern file is no longer loaded from disk on every execution (so HDD is saved as well).
  • The generated variables are named identically to the first version, so you should be able to upgrade your scripts to this version with absolutely minimal effort.
  • [Edit]
    No eval instruction is needed (unlike in the first version), further improving performance.

Functionality

  • The function takes the URI as its parameter(s) and parses it. If the parsing was successful it returns 0 (zero), if not (bad URI) it will return 1 (or any other non-zero integer).
  • The URI components are placed on global variables identical to those from the first article. There is an example script further down either way so you will get the picture.

Expected URI syntax

The URI format that is understood by this script (hopefully without bugs) is as follows:

[schema://][user[:password]@]host[:port][/path][?[arg1=val1]...][#fragment]

The actual parser code

This is the actual function. You can embed this straight into your Bash script and it will be available for execution. This method greatly simplifies the usage and even debugging of this parser.

#
# URI parsing function
#
# The function creates global variables with the parsed results.
# It returns 0 if parsing was successful or non-zero otherwise.
#
# [schema://][user[:password]@]host[:port][/path][?[arg1=val1]...][#fragment]
#
function uri_parser() {
    # uri capture
    uri="$@"

    # safe escaping
    uri="${uri//\`/%60}"
    uri="${uri//\"/%22}"

    # top level parsing
    pattern='^(([a-z]{3,5})://)?((([^:\/]+)(:([^@\/]*))?@)?([^:\/?]+)(:([0-9]+))?)(\/[^?]*)?(\?[^#]*)?(#.*)?$'
    [[ "$uri" =~ $pattern ]] || return 1;

    # component extraction
    uri=${BASH_REMATCH[0]}
    uri_schema=${BASH_REMATCH[2]}
    uri_address=${BASH_REMATCH[3]}
    uri_user=${BASH_REMATCH[5]}
    uri_password=${BASH_REMATCH[7]}
    uri_host=${BASH_REMATCH[8]}
    uri_port=${BASH_REMATCH[10]}
    uri_path=${BASH_REMATCH[11]}
    uri_query=${BASH_REMATCH[12]}
    uri_fragment=${BASH_REMATCH[13]}

    # path parsing
    count=0
    path="$uri_path"
    pattern='^/+([^/]+)'
    while [[ $path =~ $pattern ]]; do
        eval "uri_parts[$count]=\"${BASH_REMATCH[1]}\""
        path="${path:${#BASH_REMATCH[0]}}"
        let count++
    done

    # query parsing
    count=0
    query="$uri_query"
    pattern='^[?&]+([^= ]+)(=([^&]*))?'
    while [[ $query =~ $pattern ]]; do
        eval "uri_args[$count]=\"${BASH_REMATCH[1]}\""
        eval "uri_arg_${BASH_REMATCH[1]}=\"${BASH_REMATCH[3]}\""
        query="${query:${#BASH_REMATCH[0]}}"
        let count++
    done

    # return success
    return 0
}

A demonstration

Ok, I wouldn’t let you hanging there without an example, would I? This example basically uses the sample URI from the first version, and parses it accordingly. The difference is that this is an actual bash script so you can paste the parser function at the top (after #!/bin/bash) and just run it to see the parser in action.

#!/bin/bash

# parser function goes here...

# sample URI (including an injection attack)
uri='http://user:pass@www.example.com:19741/dir1/dir2/file.php?param=some_value&array[0]="123"&param2=\`cat /etc/passwd\`#bottom-left'

# perform parsing and handle failure
uri_parser "$uri" || { echo "Malformed URI!"; exit 1; }

# main uri
echo "uri               = $uri"

# mai uri components
echo "uri_schema        = $uri_schema"
echo "uri_address       = $uri_address"
echo "uri_user          = $uri_user"
echo "uri_password      = $uri_password"
echo "uri_host          = $uri_host"
echo "uri_port          = $uri_port"
echo "uri_path          = $uri_path"
echo "uri_query         = $uri_query"
echo "uri_fragment      = $uri_fragment"

# path segments
echo "uri_parts[0]      = ${uri_parts[0]}"
echo "uri_parts[1]      = ${uri_parts[1]}"
echo "uri_parts[2]      = ${uri_parts[2]}"

# query arguments
echo "uri_args[0]       = ${uri_args[0]}"
echo "uri_args[1]       = ${uri_args[1]}"
echo "uri_args[2]       = ${uri_args[2]}"

# query arguments values
echo "uri_arg_param     = $uri_arg_param"
echo "uri_arg_array[0]  = ${uri_arg_array[0]}"
echo "uri_arg_param2    = $uri_arg_param2"

Expected output

The demonstration above should output the following lines of text.
Check to see if it matches and give me some feedback with the results! 😀

uri               = http://user:pass@www.example.com:19741/dir1/dir2/file.php?param=some_value&array[0]="123"&param2=\`cat /etc/passwd\`#bottom-left
uri_schema        = http
uri_address       = user:pass@www.example.com:19741
uri_user          = user
uri_password      = pass
uri_host          = www.example.com
uri_port          = 19741
uri_path          = /dir1/dir2/file.php
uri_query         = ?param=some_value&array[0]="123"&param2=\`cat /etc/passwd\`
uri_fragment      = #bottom-left
uri_parts[0]      = dir1
uri_parts[1]      = dir2
uri_parts[2]      = file.php
uri_args[0]       = param
uri_args[1]       = array[0]
uri_args[2]       = param2
uri_arg_param     = some_value
uri_arg_array[0]  = "123"
uri_arg_param2    = \`cat /etc/passwd\`

Conclusion

So, that’s it! A far better URI parser implemented straight in Bash. However, I think it could be taken even further with your help so please give it a try and write some feedback!

Enjoy! 😀

5 comments

  1. Works perfectly.

    I copied the test code to ‘test.sh’, then copied your expected output to ‘expected’. Then ran diff on them:

    $ bash test.sh | diff -rcs expected -
    Files expected and - are identical

Don't keep it to yourself!...