-
Notifications
You must be signed in to change notification settings - Fork 11
Description
As discussed with @linas in email thread and @OlegBaskov and @glicerico in Slack:
Currently, Grammar Learner (GL) clusters words into word categories or Link Grammar (LG) rules but it does not cluster links, so all links are just individual reciprocal pairs of connectors between individual pairs of clusters. But we may need to cluster these links into categories. This may be useful for analysis of the grammar and may (or may not) be useful for better quality parsing.
That is, in current design of GL grammar induction and export to LG dictionary:
Assuming we have categories/rules A,B,C,D.
AB- and AB+ are connectors from A to B and from B to A respectively.
CD- and CD+ are connectors between C and D.
I guess (@linas ?) there is a possibility that A and C are both nouns while B and D are both verbs.
In this sense, there should be just one label for both AB and CD.
I think that is what @linas mean by "clustering links".
For example, have A being one subcategory of nouns, C another subcategory of nouns, B and D subcategories of verbs, and link/connector labels AB and CD learned from the parses. Then we can generalise that A and C are child categories for X, B and D are child categories of Z and AB and CD are child categories of XZ.
Another example from real LG dictionary created by GL:
http://langlearn.singularitynet.io/data/clustering_2018/POC-English-2018-12-31/POC-English-Amb_LG-English_dILEd_no-gen/dict_37C_2018-12-31_0006.4.0.dict
...
% AB
"a":
(ABAK+) or (ABAL+) or (ABAM+) or (ABAO+) or (ABAP+) or (ABBB+) or (ABBD+) or (ABBE+) or (ABBH+);
% AC
"are":
(AFAC- & ACBH+);
% AD
"be":
(BGAD- & BJAD- & ADBA+);
...
AB, AC and AD are categories/rules
ABAK+ is connector between AB and AK
ABAK- is connector between AK and AB
I guess, in the email thread I guess @linas suggests to cluster these connector labels in space of categories being connected by the links corresponding to the respective connectors and then replace the members of clustered groups with labels of the clusters when the grammar is exported.
I can also imagine the processes of clustering (generalising) rules and clustering (generalisation) links may go iteratively in the loop one after another till some stable equilibrium is reached.
This improvement may take place in existing "Rule Generalisation" component of GL or there could be "Grammar Generalisation" component with two sub-components called "Rule Generalisation" and "Link Generalisation" running in finite loop based on some threshold.
Also, in re-factored GL, the "Grammar Generalisation" can run without prior clustering in sparse vector space being seed with "intitial categories/rules" where each rule corresponds to single word taken from the input parses.