A bit of background
A while ago I posted an article describing how one could parse complete URIs in Bash using the sed
program. Since then, I have realized that there is a better way to do it, a much better way: via Bash built-in pattern matching!
Here are some benefits of this improvement:
- It no longer executes external programs (i.e.
sed
) for pattern matching. This translates to higher speed and lower memory and CPU usages, which means that you could use this parser for much more intense URI crunching. - The new regular expressions are drastically simplified thanks to the
${BASH_REMATCH[*]}
array that is able to hold more than 9 matched sub-expressions, unlikesed
that can only work with single-digit escapes:\1-\9
(yuck!). - The parsing algorithm is contained in a single Bash function, so no external file is needed to hold the regular expressions. This also means, obviously, that the pattern file is no longer loaded from disk on every execution (so HDD is saved as well).
- The generated variables are named identically to the first version, so you should be able to upgrade your scripts to this version with absolutely minimal effort.
- [Edit]
No eval instruction is needed (unlike in the first version), further improving performance.
Functionality
- The function takes the URI as its parameter(s) and parses it. If the parsing was successful it returns 0 (zero), if not (bad URI) it will return 1 (or any other non-zero integer).
- The URI components are placed on global variables identical to those from the first article. There is an example script further down either way so you will get the picture.
Expected URI syntax
The URI format that is understood by this script (hopefully without bugs) is as follows:
[schema://][user[:password]@]host[:port][/path][?[arg1=val1]...][#fragment]
The actual parser code
This is the actual function. You can embed this straight into your Bash script and it will be available for execution. This method greatly simplifies the usage and even debugging of this parser.
# # URI parsing function # # The function creates global variables with the parsed results. # It returns 0 if parsing was successful or non-zero otherwise. # # [schema://][user[:password]@]host[:port][/path][?[arg1=val1]...][#fragment] # function uri_parser() { # uri capture uri="$@" # safe escaping uri="${uri//\`/%60}" uri="${uri//\"/%22}" # top level parsing pattern='^(([a-z]{3,5})://)?((([^:\/]+)(:([^@\/]*))?@)?([^:\/?]+)(:([0-9]+))?)(\/[^?]*)?(\?[^#]*)?(#.*)?$' [[ "$uri" =~ $pattern ]] || return 1; # component extraction uri=${BASH_REMATCH[0]} uri_schema=${BASH_REMATCH[2]} uri_address=${BASH_REMATCH[3]} uri_user=${BASH_REMATCH[5]} uri_password=${BASH_REMATCH[7]} uri_host=${BASH_REMATCH[8]} uri_port=${BASH_REMATCH[10]} uri_path=${BASH_REMATCH[11]} uri_query=${BASH_REMATCH[12]} uri_fragment=${BASH_REMATCH[13]} # path parsing count=0 path="$uri_path" pattern='^/+([^/]+)' while [[ $path =~ $pattern ]]; do eval "uri_parts[$count]=\"${BASH_REMATCH[1]}\"" path="${path:${#BASH_REMATCH[0]}}" let count++ done # query parsing count=0 query="$uri_query" pattern='^[?&]+([^= ]+)(=([^&]*))?' while [[ $query =~ $pattern ]]; do eval "uri_args[$count]=\"${BASH_REMATCH[1]}\"" eval "uri_arg_${BASH_REMATCH[1]}=\"${BASH_REMATCH[3]}\"" query="${query:${#BASH_REMATCH[0]}}" let count++ done # return success return 0 }
A demonstration
Ok, I wouldn’t let you hanging there without an example, would I? This example basically uses the sample URI from the first version, and parses it accordingly. The difference is that this is an actual bash script so you can paste the parser function at the top (after #!/bin/bash
) and just run it to see the parser in action.
#!/bin/bash # parser function goes here... # sample URI (including an injection attack) uri='http://user:pass@www.example.com:19741/dir1/dir2/file.php?param=some_value&array[0]="123"¶m2=\`cat /etc/passwd\`#bottom-left' # perform parsing and handle failure uri_parser "$uri" || { echo "Malformed URI!"; exit 1; } # main uri echo "uri = $uri" # mai uri components echo "uri_schema = $uri_schema" echo "uri_address = $uri_address" echo "uri_user = $uri_user" echo "uri_password = $uri_password" echo "uri_host = $uri_host" echo "uri_port = $uri_port" echo "uri_path = $uri_path" echo "uri_query = $uri_query" echo "uri_fragment = $uri_fragment" # path segments echo "uri_parts[0] = ${uri_parts[0]}" echo "uri_parts[1] = ${uri_parts[1]}" echo "uri_parts[2] = ${uri_parts[2]}" # query arguments echo "uri_args[0] = ${uri_args[0]}" echo "uri_args[1] = ${uri_args[1]}" echo "uri_args[2] = ${uri_args[2]}" # query arguments values echo "uri_arg_param = $uri_arg_param" echo "uri_arg_array[0] = ${uri_arg_array[0]}" echo "uri_arg_param2 = $uri_arg_param2"
Expected output
The demonstration above should output the following lines of text.
Check to see if it matches and give me some feedback with the results! 😀
uri = http://user:pass@www.example.com:19741/dir1/dir2/file.php?param=some_value&array[0]="123"¶m2=\`cat /etc/passwd\`#bottom-left uri_schema = http uri_address = user:pass@www.example.com:19741 uri_user = user uri_password = pass uri_host = www.example.com uri_port = 19741 uri_path = /dir1/dir2/file.php uri_query = ?param=some_value&array[0]="123"¶m2=\`cat /etc/passwd\` uri_fragment = #bottom-left uri_parts[0] = dir1 uri_parts[1] = dir2 uri_parts[2] = file.php uri_args[0] = param uri_args[1] = array[0] uri_args[2] = param2 uri_arg_param = some_value uri_arg_array[0] = "123" uri_arg_param2 = \`cat /etc/passwd\`
Conclusion
So, that’s it! A far better URI parser implemented straight in Bash. However, I think it could be taken even further with your help so please give it a try and write some feedback!
Enjoy! 😀
I replaced user part ([^:/?]+) by ([^:/?]*) to handle common file:///home/… case. Nice job anyway !
Excellent :D, I’m very glad this is useful to you!
Works perfectly.
I copied the test code to ‘test.sh’, then copied your expected output to ‘expected’. Then ran diff on them:
$ bash test.sh | diff -rcs expected -
Files expected and - are identical