Ok one thing to note here is that they’re dropping some of the features that contribute to the performance costs of other re engines.
There’s another difference which is that by being on .net, which doesn’t effect their performance claims (they have the perf, how does not matter), but things like PCRE don’t use JITs - as doing so comes with significant security costs so it’s not a trivial trade off.
It's quite possible to make a regex engine faster (I've seen consistent 10x improvements in realistic use cases) if you can drop support for PCRE features that don't fit within an automata theory model, because then you can move significant processing overhead to compile-time. Many of those features are obscure, anyway.
PCRE absolutely has a JIT, though. You can run it with flags that disable it, but it's pretty important to performance otherwise.
Sorry, I was being confusing. It's a lot better than its interpreter (especially for complex inputs), but it's not as good as I'd expect from a JIT. The rust/regex entries are running an interpreter too, remember.
I have a lexer that basically works by repeatedly matching a regex until the input is exhausted, and tried dropping pcre2 in, in place of rust/regex, and turning on the JIT only made it somewhat faster, and it was still way slower than rust/regex. This was consistent even for very simple lexer regexes. When I poked at it directly and tried informally benchmarking it with whatever I could think of, I had similar results.
PCRE's operational model is still quite expensive, because it has many features that require backtracking. A JIT reduces the costs of operations, but won't keep backtracking from running them an exponential number of times.
I'm suspicious of one of the benchmarks mentioned in the referenced paper (it seems like it would stress-test the performance of utf-8 validity or boundary checks more than the regex engine itself), but I've seen some interesting discussions happen around this project.
Haven't read the paper yet, but the dotnet implementation did quite well on burntsushi's rebar benchmarks two months ago https://github.com/burntsushi/rebar.
olliej | a day ago
Ok one thing to note here is that they’re dropping some of the features that contribute to the performance costs of other re engines.
There’s another difference which is that by being on .net, which doesn’t effect their performance claims (they have the perf, how does not matter), but things like PCRE don’t use JITs - as doing so comes with significant security costs so it’s not a trivial trade off.
silentbicycle | 20 hours ago
It's quite possible to make a regex engine faster (I've seen consistent 10x improvements in realistic use cases) if you can drop support for PCRE features that don't fit within an automata theory model, because then you can move significant processing overhead to compile-time. Many of those features are obscure, anyway.
PCRE absolutely has a JIT, though. You can run it with flags that disable it, but it's pretty important to performance otherwise.
olliej | 14 hours ago
yeah, someone else said it had a jit, it's been a long time since I worked with it - we used to use it in JSC
[OP] wareya | a day ago
PCRE2 actually has a JIT. It's not enough to make it particularly competitive, though, in the cases that I've tested.
hyperpape | a day ago
I'd recommend https://github.com/burntsushi/rebar, which shows pcre2/jit as being quite good.
[OP] wareya | a day ago
Sorry, I was being confusing. It's a lot better than its interpreter (especially for complex inputs), but it's not as good as I'd expect from a JIT. The rust/regex entries are running an interpreter too, remember.
I have a lexer that basically works by repeatedly matching a regex until the input is exhausted, and tried dropping pcre2 in, in place of rust/regex, and turning on the JIT only made it somewhat faster, and it was still way slower than rust/regex. This was consistent even for very simple lexer regexes. When I poked at it directly and tried informally benchmarking it with whatever I could think of, I had similar results.
silentbicycle | 20 hours ago
PCRE's operational model is still quite expensive, because it has many features that require backtracking. A JIT reduces the costs of operations, but won't keep backtracking from running them an exponential number of times.
[OP] wareya | a day ago
I'm suspicious of one of the benchmarks mentioned in the referenced paper (it seems like it would stress-test the performance of utf-8 validity or boundary checks more than the regex engine itself), but I've seen some interesting discussions happen around this project.
hyperpape | a day ago
Haven't read the paper yet, but the dotnet implementation did quite well on burntsushi's rebar benchmarks two months ago https://github.com/burntsushi/rebar.