More

c0nstantine · on Feb 8, 2025

The grammar is underspecified. The full grammar is more complex. I guess I need just remove the current version from docs. Now it is confusing indeed.

> Why is "c" not being replaced with "da"?

It is all about precedence. According to the discussion I think I've chosen a wrong one and it raises confusion. Current version of precedence table is this:

| 1 | Escaped characters | \<special character> | | 2 | Bracket expression | [] | | 3 | Grouping | () | | 4 | Single-character-ERE duplication | * + ? {m,n} | | 5 | Transduction | : | | 6 | Concatenation | . (implicit) | | 8 | Alternation | | |

So the ':' is stronger then '.' (implicit concatenation).

c0nstantine · on Feb 8, 2025

That's true. Thank you for elaborating.

There is a hidden operator of concatenation as for usual regular expressions. In the code I denote it as lower dot '.' (as in the old Thompson's implementation).

c0nstantine · on Feb 8, 2025

[a-z] is equivalent to 'a|b|...|z' in the normal regex language.

So if we do [a-z]:[A-Z] it should be expanded to:

(a|b|...|z):(A|B|...|Z)

which is pretty legal in trre but has different meaning of mapping any a-z to ALL the A-Z (generating A-Z on each occurrence of lowercase letter).

kazinator · on Feb 8, 2025

[a-z] is a semantically equivalent regex to a|b|..|z, but the two are not equivalent syntactic forms.

Distinct syntactic forms can be given distinct semantics, as long as there is rhyme and reason.

Moreover, the right side of the colon is not the normal regex language, it only borrows its syntax. So there, we may be justified in denying that the usual equivalence holds between character class syntax and a disjunction of the symbols denoted by the class.

c0nstantine · on Feb 8, 2025

The right side is a normal regex language syntactically. Semantically it is a generator instead of a parser (consumer).

But I got your point. Maybe there could be some ways to do it in consistent way. Just straight tr-like syntax won't work, e.g I really want it something like this to be valid:

[a-b]:(x|y) (pairs a:x, b:x, a:y, b:y)

and I prefer not handle these in some ad-hoc way.

kazinator · on Feb 8, 2025

I also go your point. The right side is a regular expression because it denotes a regular set.

c0nstantine · on Feb 8, 2025

Thank you for the link. I think I came across it some years ago. They implement weighted transducers. Nice tool for things like morphology from the era before the LLMs. I've implemented something similar 8 years ago: https://github.com/c0stya/fslib

c0nstantine · on Feb 8, 2025

I guess folks generally more interested in searching for the pattern then modifying it.

> Tools like sed build a transducer around the whole automaton: s/this/that/g.

That sounds reasonable. Could you provide any links on sed internals? Thanks.

c0nstantine · on Feb 7, 2025

Hey, I didn't claim it is something groundbreaking. The idea is quite old, indeed. You don not need AI or LLMs here.

The sed is superior, actually. I do not cover all the functions sed provides. I think of it more like 'tr' + regexp. But it has different underlying engine and might be faster and more expressive for some use cases (e.g. tokenization, morphology).

metadat · on Feb 7, 2025

Thanks for the clarification, totally on me to set expectations inappropriately based on only a headline. Take care.

c0nstantine · on Feb 7, 2025

Are you using MAC? For tests please try:

$ make && bash test.sh

with 'bash' instead.

For the second part it is a bug in the README. Thank you for pointing this out! I had to be more careful before the publication. Fixed. Try '-ma' flags instead.

$echo '' | trre -ma ':(0|1){,3}?'

simlevesque · on Feb 8, 2025

I'm using Linux.

c0nstantine · on Feb 10, 2025

Did it solve the problem? I guess the issue is the process substitution construction of bash "<()". Not all shells support this.

c0nstantine · on Feb 7, 2025

Thank you for the feedback. Yes, the precedence is a question for me. Maybe I will change this.

If I shift it behind concatenation there could be another problem. E.g. with non-associative : should be illegal. And I am not sure how to treat this:

cat:dog:mouse

In the current version I inject the epsilon (empty string). It looks natural E.g. to remove every second letter I could run '..:' which is technically '.(.:eps)':

echo 'abcde' | ./trre '..:'

result: 'ace'

actually ':' association could have a meaning as a composition of regular relations; but I found it too complicated for now.

kccqzy · on Feb 7, 2025

I do not understand the rules by which you inject the epsilon and I think this is a source of confusion for many people. I had thought that an epsilon could be injected anywhere REGEX can appear (effectively allowing epsilon as a REGEX) but of course that just leads to infinite number of parses. Manually injecting epsilon is a highly hacky thing to do; better to consider that when you design the grammar.

I would not worry about "cat:dog:mouse" because intuitively it is clearly correct and it means replacing cat with mouse. With parentheses it could be written as "((cat:dog):mouse)".

c0nstantine · on Feb 9, 2025

Epsilon injection appears whenever right or left side of ':' has no operand. E.g.

(:a)

(a:)

a:|b

a|b:

etc

I will try to change the precedence and see how it works. Btw what do you think about explicit operators '>' '<' where '<' works as usual regex matcher, and '>' as a generator. For example to change 'cat' to 'dog' there could be something like '<cat>dog' where '<cat' part is a parser and '>dog' is a generator. Thanks.

kccqzy · on Feb 10, 2025

I think your epsilon injection rule is trying to achieve this kind of production:

    TRRE <- TRRE ':' REGEX | ':' TRRE | TRRE ':' | REGEX | ...

I think this would work better, but ':a:' is still ambiguous: it has two parse trees.

c0nstantine · on Feb 7, 2025

Thank you! For the feedback and pointing to the typo. Fixed. Actually my C is very rusty and I am bit uncomfortable about this.

c0nstantine · on Feb 7, 2025

yeah. Transducers are very old topic. For some reason they were not connected to a specific language like regex.

> wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?

That's something I am still not sure about. I took a hundred examples and it looked more natural this way (: lower then .). But I can change it with the change of one digit in the code, literally. That's why I'm posting here. I need some real feedback.

froh · on Feb 7, 2025

hm maybe juxtapose a number of examples in one precedence and then the other? and share them to gather feedback?

also,

* colon as member of a character class (and dash and how they interact)

* posix character classes (and the colon in the syntax)

* word boundaries are really useful in practice

* think of ünicøde early and look at what russ cox did there

boundaries, what do you decide to exclude? for example back references and grouping have fun with DFAs (again see russ cox and re2)

composition is fantastic, and a shortcut to exponential land because grammars (as in Chomsky hierarchy) can easily be expressed with transducers, yay.

boundaries will also clarify use cases and allow to state: "if you want to do xyz please use re2 and a bit of code instead"

and one "technicality" that hit me once with re2 was a static buffer for resumable iteration. I'd loved to reallocate the buffer and drop it's beginning and extend it at the end but alas, the state of re2 has hard pointers into the buffer, not relative ones. I think this was when re2 hit the end of the buffer mid-expression during incremental parse. so you can't reallocate and instead you have to drop the hard earned state and resume.

anyhow, it's been quite a while so I'm no longer in the thicket of this :-D

what's your driver? curiosity? or some itch?

but I really enjoy seeing your project :-)

c0nstantine · on Feb 8, 2025

Thank you for the detailed comment.

So Unicode is something on a top of my TODO list. Boundaries is a very interesting topic. Maybe I'll extend the doc to include the details.

> what's your driver? curiosity? or some itch?

It's an itch :). 8 years ago I explored automata theory and found finite transducers to be a handy and natural way to transform texts. And as regex corresponds to FSA (finite state acceptors) I wanted to create a language that corresponds to FST (finite state transducers). There is a lesser-known algorithm for FST determinization and I want to applied it to make the transformation fast and efficient. It turned out to be not that simple as I expected initially.

HN For You