tatr lets you recursively query into files on your filesystem for tags/words that match the given
set of POSIX regular expressions
within indented trees of text - and then
extract the exact paths within these trees that match your query. We use a lot of different
structured textual formats with tree-structures based on text-indentation - which are all queryable
via tatr.
As tatr by default is independent from any syntax other than indentations, and as tatr extracts
only the parts of the trees you are interested in - you can query for
indented trees within any possibly incompatible format.
Note that the given regular expressions are anchored - so they will match the whole word or nothing at all.
This is the natural semantics for specifying tags.
You can override this by explicitly matching on any character sequence with .*.
A bunch of useful parameters exist - some are the tree-extraction functions that control what parts of the matched trees are returned:
--extract-completetree, --ect
Extract the complete tree where the query match somewhere within.
This includes all branches - even those that don't match the query.
--extract-fulltree, --eft
Extract the paths of tree that match query exactly + their subtrees
and ancestors.
--extract-matchtree, --emt
Extract only the paths of tree that match query exactly, excluding
the rest of the tree.
--extract-subtree, --est
Extract the paths of tree that match query exactly + their
subtrees. This is the default.
See tatr --help for the complete list.
> tatr --match-file='*.txt' fix,code ~/my_notes-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt -----------------
00008: [ ] @code; would be nice to have this feature
00009: * which spans some lines
00010: [ ] @fix that we have no bugs
-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt
00011: * fix that these lines are also matched because of '@' not being a word-character by default
00012: * a line with the word 'code'
-- ~/my_notes/code/tatr.txt ----------------------------------------------------
00012: [ ] @code; @fix some error
--------------------------------------------------------------------------------
.. where fix and code are words (a commaseparated list of regular expressions) that need to be present in the matched tree, and ~/my_notes is a directory to look in.
Indented trees of text in notes usually come in the form of deeply nested indented points.
I will probably write a blogpost in the future about how it's useful to take notes this way.
There are several options that allow you to configure tatr for your specific format - e.g. --include-chars lets you
specify that e.g. #/@ should be included in matched words, so you can limit yourself to match tags as specified in
the given text-format:
> tatr --match-file='*.txt' --include-chars=@ @fix,@code ~/my_notes-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt -----------------
00008: [ ] @code; would be nice to have this feature
00009: * which spans some lines
00010: [ ] @fix that we have no bugs
-- ~/my_notes/code/tatr.txt ----------------------------------------------------
00012: [ ] @code; @fix some error
--------------------------------------------------------------------------------
tatr can also be used to query your configs. Here we query all dune files of this ocaml repository for references to a
specific library, containers:
> tatr --extract-fulltree --match-file=dune libraries,containers .-- ./lib/dune ------------------------------------------------------------------
00001: (library
......
00011: (libraries
00012: containers
--------------------------------------------------------------------------------
.. there are different extraction algorithms to let you choose what part of the matched tree is returned.
In this example --extract-fulltree both includes the subtree beneath the match (none here) and the ancestors
towards the root (here library). Line-ranges in matched trees that are hidden (as per the chosen extraction algorithm) are
printed as .......
As the tags are POSIX regular expressions separated by comma, you can express more complex patterns like tag0,(tag1|tag2),
where tatr will match on all paths in trees that both include tag0 and -- tag1 or tag2.
To extract all library dependencies, preprocessors and the name of each executable/library you could do:
> tatr --eft --match-file=dune '(name|libraries|preprocess)' . -- ./bin/dune ------------------------------------------------------------------
00001: (executable
......
00004: (name main)
......
00013: (libraries
00014: tatr
00015: cmdliner
00016: )
-- ./lib/dune ------------------------------------------------------------------
00001: (library
00002: (name tatr)
......
00011: (libraries
00012: containers
00013: re
00014: uuseg
00015: uunf
00016: unix
00017: )
00018: (preprocess
00019: (pps
00020: ppx_deriving.std
00021: )
00022: )
--------------------------------------------------------------------------------
Clone repo:
git clone https://github.com/rand00/tatr.git
cd tatrInstall dependencies for tatr via opam:
opam install dune containers cmdliner re ppx_deriving Compile tatr itself:
dune buildInstall the binary:
cp -f _build/default/bin/main.exe ~/bin/tatr
For textual search - often one will use something like a
mix of grep and find or
ripgrep -
where you can
search for patterns within a single line - but these tools don't operate at the tree-level like tatr.
A notetaking system like zim let you query for matching pages that contains a set of tags - but not at the indented tree-level.
Another notetaking system, logseq, has a feature called flashcards where you tag a point-indented line with #flashcard and
then make a nested point with the answer. Then the system can play back questions to you and expose the answer upon request.
It will be relatively simple for the interested user to implement this via the existing tree-matchers of tatr.
I've for many years been organizing my notes in zim - which lets you organize your notes in trees, that are represented as wiki-files placed in folders on your filesystem, and link your pages in a graph. These features are extremely powerful by themselves, for creating your own custom organization for remembering what you are doing and have done before.
Another feature of zim is tags which allow you to select pages that include several custom tags you've made.
I found that I often create long lists of indented notes within single pages. Where a lot of relatively unrelated things are placed. What I really want is to be able to query for what specific sections of my pages include a set of tags - and to extract these sections.
I realized that we use a lot of other structured formats based on textual indentation to represent trees of related elements - which
is why tatr by default doesn't know about any specific textual format - but operates solely based on indentation.
This method is e.g. compatible with
- note-taking formats: markdown, wiki, org-mode, ...
- pretty-printed config formats: json, s-expressions, xml, ...
- indented trees within other textbased formats
There are some limitations of the default method of tatr when working with formats that hide their tree-nature within special syntax.
This e.g. includes headings from markdown and wiki-formats, certain config-formats like yaml, and a lot of syntax in programming languages.
This is a minor limitation though, as
- the headings of note-taking formats often duplicate text present under them
- including the tree-structure of headings etc. will lead to very deep trees being matched - where the matching words can be very far apart; which makes them less related
tatroutputs the line-numbers, so you don't need headings to find the match in the source-filetatrwas not made to match on abstract syntax trees of code - which needs a different kind of matcher that can parse the given programming language- when using any matcher but the
--match-completetree,tatrwill extract only the part of the tree you are interested in; which lets you query for anything that has an indented treestructure within any other kind of format. For example, query for your todo-notes within the code of your codebase.
Another problem is if your structured format is not pretty-printed within each file - so the structure is not laid out via indentation.
To solve this you can pass your structured format to some pretty-printer like: cat my.json | jq '.' > my_pretty.json.
tatr could possible get a feature to apply a user-specified script to text-files on recursive traversal.
The query-DSL of tatr is a comma-separated list of regular expressions, which is currently parsed simply by splitting on comma.
This means that you can't use commas in your regex's - which I don't know why you would want. If this
becomes a need, a new multishot CLI option --tag could be added.
A current ideal of tatr is to be as independent as possible from specific formats - and let the user rely on existing tools to make the given
text compatible with the tatr indentation-based interpretation.
Future support for hidden tree-structures of some of the beforementioned formats can be added to tatr via building in pre-indenters, that
know parts of the syntax's of each format - and in a streaming fashion updates a synthetic-indentation-state that
modifies what tatr thinks the real indentation-level is. If built into tatr in this way, the matched printed trees will
keep the original indentations.