Skip to content

rand00/tatr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tatr - tagged tree matching

tatr lets you recursively query into files on your filesystem for tags/words that match the given set of POSIX regular expressions within indented trees of text - and then extract the exact paths within these trees that match your query. We use a lot of different structured textual formats with tree-structures based on text-indentation - which are all queryable via tatr.

As tatr by default is independent from any syntax other than indentations, and as tatr extracts only the parts of the trees you are interested in - you can query for indented trees within any possibly incompatible format.

Note that the given regular expressions are anchored - so they will match the whole word or nothing at all. This is the natural semantics for specifying tags. You can override this by explicitly matching on any character sequence with .*.

A bunch of useful parameters exist - some are the tree-extraction functions that control what parts of the matched trees are returned:

  --extract-completetree, --ect
      Extract  the  complete tree where the query match somewhere within.
      This includes all branches - even those that don't match the query.

  --extract-fulltree, --eft
      Extract the paths of tree that match query exactly + their subtrees
      and ancestors.

  --extract-matchtree, --emt
      Extract only the paths of tree that match query exactly,  excluding
      the rest of the tree.

  --extract-subtree, --est
      Extract  the  paths  of  tree  that  match  query  exactly  + their
      subtrees. This is the default.

See tatr --help for the complete list.

Examples

Matching on indentation-trees containing specific words

> tatr --match-file='*.txt' fix,code ~/my_notes
-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt -----------------

00008:  [ ] @code; would be nice to have this feature
00009:          * which spans some lines
00010:                  [ ] @fix that we have no bugs

-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt

00011:  * fix that these lines are also matched because of '@' not being a word-character by default 
00012:          * a line with the word 'code'

-- ~/my_notes/code/tatr.txt ----------------------------------------------------

00012:  [ ] @code; @fix some error

--------------------------------------------------------------------------------

.. where fix and code are words (a commaseparated list of regular expressions) that need to be present in the matched tree, and ~/my_notes is a directory to look in. Indented trees of text in notes usually come in the form of deeply nested indented points.

I will probably write a blogpost in the future about how it's useful to take notes this way.

Configuring tatr to include special characters to search for tags

There are several options that allow you to configure tatr for your specific format - e.g. --include-chars lets you specify that e.g. #/@ should be included in matched words, so you can limit yourself to match tags as specified in the given text-format:

> tatr --match-file='*.txt' --include-chars=@ @fix,@code ~/my_notes
-- ~/my_notes/code/tatr/subtodos/20260302_some_new_feature.txt -----------------

00008:  [ ] @code; would be nice to have this feature
00009:          * which spans some lines
00010:                  [ ] @fix that we have no bugs

-- ~/my_notes/code/tatr.txt ----------------------------------------------------

00012:  [ ] @code; @fix some error

--------------------------------------------------------------------------------

Querying your configuration-files for specific library dependencies

tatr can also be used to query your configs. Here we query all dune files of this ocaml repository for references to a specific library, containers:

> tatr --extract-fulltree --match-file=dune libraries,containers .
-- ./lib/dune ------------------------------------------------------------------

00001:  (library
......
00011:   (libraries
00012:    containers

--------------------------------------------------------------------------------

.. there are different extraction algorithms to let you choose what part of the matched tree is returned. In this example --extract-fulltree both includes the subtree beneath the match (none here) and the ancestors towards the root (here library). Line-ranges in matched trees that are hidden (as per the chosen extraction algorithm) are printed as .......

Querying your configuration-files for all library dependencies

As the tags are POSIX regular expressions separated by comma, you can express more complex patterns like tag0,(tag1|tag2), where tatr will match on all paths in trees that both include tag0 and -- tag1 or tag2.

To extract all library dependencies, preprocessors and the name of each executable/library you could do:

> tatr --eft --match-file=dune '(name|libraries|preprocess)' . 
-- ./bin/dune ------------------------------------------------------------------

00001:  (executable
......
00004:   (name main)
......
00013:   (libraries
00014:    tatr
00015:    cmdliner
00016:    )

-- ./lib/dune ------------------------------------------------------------------

00001:  (library
00002:   (name tatr)
......
00011:   (libraries
00012:    containers
00013:    re
00014:    uuseg
00015:    uunf
00016:    unix
00017:    )
00018:   (preprocess
00019:    (pps
00020:     ppx_deriving.std
00021:     )
00022:    )

--------------------------------------------------------------------------------

Compiling

Clone repo:

git clone https://github.com/rand00/tatr.git
cd tatr

Install dependencies for tatr via opam:

opam install dune containers cmdliner re ppx_deriving 

Compile tatr itself:

dune build

Install the binary:

cp -f _build/default/bin/main.exe ~/bin/tatr

Related tools

For textual search - often one will use something like a mix of grep and find or ripgrep - where you can search for patterns within a single line - but these tools don't operate at the tree-level like tatr.

A notetaking system like zim let you query for matching pages that contains a set of tags - but not at the indented tree-level.

Another notetaking system, logseq, has a feature called flashcards where you tag a point-indented line with #flashcard and then make a nested point with the answer. Then the system can play back questions to you and expose the answer upon request. It will be relatively simple for the interested user to implement this via the existing tree-matchers of tatr.

History

I've for many years been organizing my notes in zim - which lets you organize your notes in trees, that are represented as wiki-files placed in folders on your filesystem, and link your pages in a graph. These features are extremely powerful by themselves, for creating your own custom organization for remembering what you are doing and have done before.

Another feature of zim is tags which allow you to select pages that include several custom tags you've made.

I found that I often create long lists of indented notes within single pages. Where a lot of relatively unrelated things are placed. What I really want is to be able to query for what specific sections of my pages include a set of tags - and to extract these sections.

I realized that we use a lot of other structured formats based on textual indentation to represent trees of related elements - which is why tatr by default doesn't know about any specific textual format - but operates solely based on indentation. This method is e.g. compatible with

  • note-taking formats: markdown, wiki, org-mode, ...
  • pretty-printed config formats: json, s-expressions, xml, ...
  • indented trees within other textbased formats

Limitations

There are some limitations of the default method of tatr when working with formats that hide their tree-nature within special syntax. This e.g. includes headings from markdown and wiki-formats, certain config-formats like yaml, and a lot of syntax in programming languages. This is a minor limitation though, as

  • the headings of note-taking formats often duplicate text present under them
  • including the tree-structure of headings etc. will lead to very deep trees being matched - where the matching words can be very far apart; which makes them less related
  • tatr outputs the line-numbers, so you don't need headings to find the match in the source-file
  • tatr was not made to match on abstract syntax trees of code - which needs a different kind of matcher that can parse the given programming language
  • when using any matcher but the --match-completetree, tatr will extract only the part of the tree you are interested in; which lets you query for anything that has an indented treestructure within any other kind of format. For example, query for your todo-notes within the code of your codebase.

Another problem is if your structured format is not pretty-printed within each file - so the structure is not laid out via indentation. To solve this you can pass your structured format to some pretty-printer like: cat my.json | jq '.' > my_pretty.json. tatr could possible get a feature to apply a user-specified script to text-files on recursive traversal.

The query-DSL of tatr is a comma-separated list of regular expressions, which is currently parsed simply by splitting on comma. This means that you can't use commas in your regex's - which I don't know why you would want. If this becomes a need, a new multishot CLI option --tag could be added.

A current ideal of tatr is to be as independent as possible from specific formats - and let the user rely on existing tools to make the given text compatible with the tatr indentation-based interpretation.

Future features

Future support for hidden tree-structures of some of the beforementioned formats can be added to tatr via building in pre-indenters, that know parts of the syntax's of each format - and in a streaming fashion updates a synthetic-indentation-state that modifies what tatr thinks the real indentation-level is. If built into tatr in this way, the matched printed trees will keep the original indentations.

About

Indentation-based tagged tree querying. Query notes or configs using a simple expressive DSL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors