This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
EveryOS's profile picture

How to build a regex engine

Started by EveryOS, 02 August 2018 - 07:57 PM
EveryOS #1
Posted 02 August 2018 - 09:57 PM
I was wondering how I would go about building a regex engine
In computercraft, this was the file containing the regex engine. To me, it looks like a million lines of gibberish.
So far, this is the code I had for my version

--rules
--"?": optional
--"+": repeating
--"*": interuppting
local function copy(t)
  local rtn = {}
  for k, v in pairs(t) do rtn[k] = v end
  return rtn
end
local function find(str, char)
  for i=1, #str do
    if string.sub(str, 1, 1) == char then
	  return true
    end
  end
  return false
end
local function resolveRuleData(token)
  token = copy(token)
  local function resolve(c)
    return not not find(token[1], c)
  end
  local dat = {
    optional = resolve("?"),
    repeating = resolve("+"),
	  interuppting = resolve("*"),
	  multi = #token > 2
    }
    table.remove(token, 1)
    return dat, token
  end
  local match
  match = function(rule, token)
  rule = copy(rule)
  token = copy(token)
  local dat = resolveRuleData(rule, 1)
  if dat.optional then
    rule[1] = string.gsub(rule[1], "?", "")
    local s, nrule, ntoke = match(rule, token)
    if s then
	  rule, token = nrule, ntoken
    end
    return true, rule, token
  elseif dat.interuppting then
  else
  end
end
It already looks like gibberish

I have multiple reasons to write my own regex engine. For example, I need to compare tables of strings rather than single letters. And I need to sneak in additional logic.
What I really can't wrap my head around is repeating characters/symbols, because there is the chance that the "+" token is used multiple times

In addition, It would be nice to do this without a recursive function, maybe a tree or something

Actually, I'm clueless on how to build a regex engine…
Edited on 02 August 2018 - 08:06 PM
Bomb Bloke #2
Posted 03 August 2018 - 03:17 AM
In computercraft, this was the file containing the regex engine.

To be clear, that's Lua's string library, and it uses patterns - not regex.

I have multiple reasons to write my own regex engine. For example, I need to compare tables of strings rather than single letters.

Hrmmrmm, this wouldn't be related to performing Burrows–Wheeler transforms for MZIP, would it? If it is, then fair warning: the amount of RAM needed to store a table with all the required string permutations is about equal to the length of the data squared. For a 10kb file, that's 100mb. For a 100kb file, that's 10gb. Building a string table is the naive approach, and it's not feasible for anything other than tiny amounts of information.

Also bear in mind that regular Lua pattern matching can totally handle more than "single letters". It may be worth providing some more details about your use-cases, as you might not need this "engine" at all.

In addition, It would be nice to do this without a recursive function, maybe a tree or something

I doubt you'd overflow the stack with a project like this, but if you really want to avoid recursion, you could set up a regular loop with a job table.
EveryOS #3
Posted 03 August 2018 - 07:40 PM
Nothing to do with MZIP.

Ok, so first off I did not give the whole file

How this will work is I'll input like a rule, which might look like this:
{{"+", "letter", "comma"}, {"?", "semicolon"}}

And then I might ask it to match against these:

{"letter", "comma", "letter", "comma", "semicolon"} --Works
{"letter", "comma"} --Works
{"semicolon"} --Does not work
{"letter", "semicolon"} --Does not work


Actually, it's a bit more advanced.
I am also going to carry additional data with the things we are matching against, so instead of this:
{"letter", "comma", "letter", "comma", "semicolon"}
It would be maybe like this, though what I probably would do is strip the extra data (so it looks like the above):
{{"letter", "X"}, {"comma"}, {"letter", "A"}, {"comma"}, {"semicolon"}}

In addition, if the first portion matches, but then there is extra data, I would simply like to return the matching portion

In addition, there is a "*" rule, which might be tricky, but this can be just ignored for now
Edited on 03 August 2018 - 06:13 PM
SquidDev #4
Posted 03 August 2018 - 10:27 PM
If you just want a basic regular expression system (so truly regular, rather than extensions to it), I'd recommend having a read of Regular expression matching can be simple and fast. You can think of a regular expression as a finite state machine (or a sort of flowchar) - in fact, the two are equivalent.

Consider your above expression (represented as a Lua pattern of "(%a,)+;?". We can draw this out as so:



You start at at state 0 and, for every letter in your string, you find an arrow going from that state whose label matches the current character. So, for an input "b,c,;" we first look at b and state 0 - this matches our %a and so we move to state 1 and ,. Again, this matches our arrow and so we move to state 2. Here we've got some options: the ε represents an empty string, so we also add the states those arrows point to - thus we're now looking at three possible states: 0, 2 and 3. Out input here is a, so only one arrow matches - thus we move onto state 1. Eventually, you'll either end up in a position where no arrows match (and so the regex fails), or you're at the end of the string.

The above page(s) go into a lot more detail into ways to speed this up (such as avoiding having to be in multiple states). Similar techniques to these are used in several major regex libraries (though many just do naive backtracking).
EveryOS #5
Posted 04 August 2018 - 02:08 AM
Thanks, I'll look at the site in more detail tommorow