Case study: fixed number of iterations with LPeg

Words seem to fail to describe just how awesome LPeg is. Designed as a Lua implementation of the PEG concept, it is a true programming gem! Please, if you dont’t know what it is, take some time to familiarize yourself with it! It’s not the easiest thing to grasp, but you will not regret it! It is certainly one of the most worthwhile learning efforts you can make in generic parsing.

One great feature of LPeg is that it’s binary-safe, meaning that (unlike regular expressions) it can be safely used to parse binary data! This makes it an excellent tool for parsing binary protocols, especially network communication protocols, such as the Action Message Format (used by Adobe Flash for making remote calls and even in FLV movie files). I’ll leave it to you to explore the possibilities…

Beware that from here on, I assume that you know your way around Lua, LPeg and how they work.

The problem

That being said, this article is actually about an unusual roadblock I hit while using LPeg to build a Lua-based AMF parser, and the various solutions I found and/or came up with to overcome it (you didn’t think that I mentioned AMF before by accident, did you?).

The issue is LPeg’s implementation of repetitive patterns: in particular, its inability to match (or capture) a fixed number of occurrences of a certain pattern, although it can match a minimum or a maximum number of such occurrences, which is perfect for stream-oriented parsing (such as parsing programming languages) but insufficient for binary data.

Just to clarify, here’s a small list of LPeg patterns which correspond to the typical PCRE repetitive constructs (in each case we’re trying to match the string ‘cloth’):

Matching occurrences of ‘cloth’ PCRE pattern LPeg pattern
0 or more (at least 0) /(cloth)*/ lpeg.P'cloth'^0
1 or more (at least 1) /(cloth)+/ lpeg.P'cloth'^1
X or more (at least X) /(cloth){X,}/ lpeg.P'cloth'^X
1 or less (at most 1) /(cloth)?/ lpeg.P'cloth'^-1
X or less (at most X) /(cloth){,X}/ lpeg.P'cloth'^-X
A: precisely X (no more, no less) /(cloth){X,X}/ -- unavailable --
B: anywhere between X and Y /(cloth){X,Y}/ -- unavailable --

For the last two cases (i.e. A and B), LPeg does not offer any simple constructs so we have to find a complex one. Let’s put aside the case B for now, and try to tackle A…

Attempt 1 (brute force)

The first solution was rather straight-forward: we explicitly compose (i.e. append) a pattern for every occurrence we need to match. I call this the brute force method (for obvious reasons). For example, to match the word ‘cloth’ exactly three times we would use the pattern lpeg.P'cloth' * lpeg.P'cloth' * lpeg.P'cloth'.

Naturally, this is quite inflexible: what if we need to match 183 occurrences? To achieve that, we use a pattern generation function. Here’s a complete working example:

local lpeg = require'lpeg'

-- Version 1.
-- Generates a big pattern where every interation has
-- a corresponding instance of the initial pattern. The
-- resulting pattern captures all occurrences in a table.
function multiply_1(item, count)
    local set = lpeg.P(true)
    while count > 0 do
        set = set * item
        count = count - 1
    end
    return lpeg.Ct(set)
end

-- Generate subject ('cloth,cloth,...').
local subject = string.rep('cloth,', 10)

-- A capture of the word 'cloth' optionally followed by a comma.
local item = lpeg.C(lpeg.P'cloth') * lpeg.P','^-1

-- Match exactly 10 items.
local result = lpeg.match(multiply_1(item, 10), subject)

-- Display capture table.
print(unpack(result))

This function adds a composition operation to the big pattern for each occurrence. One method of improving this is to reduce the amount of compositions to the a bare minimum. We do this by duplicating (i.e. growing in powers of 2) the pattern until the required size is reached (also used by LPeg’s regexp emulator):

-- Version 1(b).
-- Generates a (potentially large) pattern using only
-- the minimum possible amount of compositions.
function multiply_1b(item, count)
    local set = lpeg.P(true)
    while count >= 1 do
        if count % 2 >= 1 then
            set = set * item
        end
        item = item * item
        count = count / 2
    end
    return lpeg.Ct(set)
end

Looks good, however, this method presents a serious problem: the resulting pattern can get very large, forcing the limits of the Lua stack. It works well enough for small jobs, but it is unwise to generate more than a few hundred repetitions using this method (and even that may be pushing it). This means that the brute force method is unusable for heavy binary parsing, where we would often need to match even thousands of pattern repetitions.

We need to do better…

Attempt 2 (capture folding)

The next approach is a bit more complex and makes use of the more advanced features of LPeg: the lpeg.Cf() and lpeg.Cmt() captures.

In essence, we match at most the desired number of occurrences (using the ^-N operation described in the table at the top) and on every match (i.e. iteration) we execute a folding function (via lpeg.Cf()) that adds the item to the result table and also decreases a counter. At the end, we execute a checking function (via lpeg.Cmt()) which will test if the counter reached 0 (thus ensuring exact matching) and will fail if not.

-- Version 2.
-- Somewhat better approach, which starts with an empty table and gradually
-- folds every occurring capture of the initial pattern into this table, at
-- the same time decrementing the counter. Eventually, it also checks if the
-- counter reached 0 and fails otherwise.
function multiply_2(item, count)
    return lpeg.Cmt(

        -- Fold capture.
        lpeg.Cf(
            lpeg.Ct'' * item^-count,

            -- Append capture.
            function(set, item)
                count = count - 1
                table.insert(set, item)
                return set
            end),

        -- Final check.
        function(s, i, set)
            if count > 0 then
                return false
            end
            return i, set
        end)
end

Note that the capture starts with an empty table created from the empty string (i.e. lpeg.Ct''). This table is the first parameter given to the folding function (i.e. set); after all the iterations, it will contain all captured items and will then be given to the check-up function for evaluation. Eventually, if all goes well, it will be returned back as the final result of multiply().

Overall this method will behave much better than the first; the generated pattern is much more elegant. However, the generated pattern can still grow quite large in memory, even though it’s not that obvious (i.e. I was surprised to find that it also crashed when generating more than a few thousand iterations).

The issue is the item^-count bit. Checking the C code of the LPeg implementation, quickly reveals that whenever count is not 0 (i.e. which is a special case), the engine is forced to allocate count objects in memory, and while these objects are lighter than a complete item pattern (as in the first solution) they are still scaling-up the problem badly!

Eventually, this method only allows generating repetitive patterns of a few thousand iterations (that is as far as I was able to push it). But, what if we need to handle a rogue pattern of 35 000 occurrences? What then? No!… we still need better!

Attempt 3 (classic loop)

The only thing left to try is a (slightly) more rudimentary technique. Basically, it is the same as the second approach but with a significant twist: instead of using lpeg.Cf() to “accumulate” (i.e. fold) every occurrence of item to the result table, we just use a straight-forward loop that will advance through the input stream at every iteration.

That makes the pattern completely dynamic, i.e. totally non-proportional to the number of matched occurrences. As a result, this method allows for an unlimited number of iterations with no memory wastes.

-- Version 3.
-- Best method so far, capable of an unlimited number of iterations. It uses
-- an explicit loop to walk over all all iterations at match-time, so there
-- is no potentially huge pattern ever created. This method simply populates
-- the result table at each iteration, or fails the entire set if the count
-- is not reached precisely.
function multiply_3(item, count)
    return lpeg.Cmt(lpeg.P(true),
        function(s, i)
            local set, offset = {}, i
            for j = 1, count do
                set[j], offset = lpeg.match(item * lpeg.Cp(), s, offset)
                if not offset then
                    return false
                end
            end
            return offset, set
        end)
end

When applied to a string, this will start matching at the current position (i.e. due to the lpeg.P(true) construct). The lpeg.Cmt() capture will trigger the iteration function immediately and there, in turn, will loop for ‘count’ number of times, matching the item at every iteration and appending it to the results table. At the same time, at each iteration, the offset is advanced by capturing the position immediately following the matched item (i.e. the lpeg.Cp() construct). If anything goes wrong, then the entire match fails (i.e. via return false).

This method works flawlessly (as far as I could tell). The generated pattern is minimal (memory-wise), the source code is very small, and the function is overall very fast (only slightly slower than the first two approaches, when matching small-sized patterns). But most importantly, with this version, matching huge numbers of item repetitions is trivial (e.g. tens of thousands, millions…).

Intervals

From here, implementing the case B (i.e. multiply(X, Y)) is quite easy…

One approach would be to use multiply_3(item, X) to ensure the minimum number of occurrences (i.e. X) and then append the pattern item^-(Y-X) which will optionally match the remaining occurrences up to the maximum amount allowed. This method will work, and can also be used with the first two solutions, but will still (potentially) waste memory when (Y-X) is large.

A better (and still easy) approach would be to modify the loop inside multiply_3() to also accept an optional number of iterations (above the minimum). Here’s how…

-- Version 3(b).
-- The final variant with support for interval matching
-- based on the third capturing approach (classic loop).
function multiply_3b(item, min, max)
    return lpeg.Cmt(lpeg.P(true),
        function(s, i)
            local set, offset, attempt = {}, i, 0
            for j = 1, max or min do
                set[j], attempt = lpeg.match(item * lpeg.Cp(), s, offset)
                if not attempt then
                    if j > min then
                        break
                    else
                        return false
                    end
                end
                offset = attempt
            end
            return offset, set
        end)
end

In conclusion…

So, that’s that. Enjoy!
I would be surprised if anyone would actually read this banter.
If you did, hat’s off! 😉

2 comments

  1. multiply_3b() never matches for me. Am I using it wrong? 

    print(lpeg.match(lpeg.C(multiply_3b(‘a’, 2, 2)), ‘aa’))

    1. Yes, you are doing it wrong. The problem is that the lpeg.C() construction should be the (first) parameter of multiply_3b() and not viceversa. Additionally, the result from lpeg.match() will be a table, so in order to print its members you must use unpack() first.

      The right way to do what you want (in a single line) is this:
      print(unpack(lpeg.match(multiply_3b(lpeg.C(‘a’), 2, 2), ‘aa’)))

Comments are closed.