Skip to content

New document class#20

Open
corentin-larose wants to merge 1 commit intocamspiers:masterfrom
qapa:new-document-class
Open

New document class#20
corentin-larose wants to merge 1 commit intocamspiers:masterfrom
qapa:new-document-class

Conversation

@corentin-larose
Copy link
Collaborator

Work in progress, just to know if you like this lead...

@camspiers
Copy link
Owner

I like the idea, but I am still thinking about the implications.

@corentin-larose
Copy link
Collaborator Author

Just figured out that I put my explanations in the commit message, don't know if you saw them (but yes, a lot to think... It's just a lead):

For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:

    public function classify(Document $document)
    {
    $results = array();

    /*
        if ($this->documentNormalizer) {
            $document = $this->documentNormalizer->normalize($document);
        }

        $tokens = $this->tokenizer->tokenize($document);

        if ($this->tokenNormalizer) {
            $tokens = $this->tokenNormalizer->normalize($tokens);
        }

        $tokens = array_count_values($tokens);
    */

    $tokens = $document;
    [...]

My pros

  • Strong contract for Documents through interface
  • Document is in a frequency state ASAP
  • Document API is very wide open (cf Unit Tests)
  • Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
  • Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
  • Agnostic approach using SPL
  • One can even use closures/built-in functions for normalizers/tokenizers (faster?)
  • Hydrators/Extractors made simplier
  • Some more document-level calculations could be done in the instance
  • TokenCountByDocument no longer necessary

My cons

  • Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
  • Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
  • Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
  • Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
  • Your cons?

For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:
```php
    public function classify(Document $document)
    {
	$results = array();

	/*
        if ($this->documentNormalizer) {
            $document = $this->documentNormalizer->normalize($document);
        }

        $tokens = $this->tokenizer->tokenize($document);

        if ($this->tokenNormalizer) {
            $tokens = $this->tokenNormalizer->normalize($tokens);
        }

        $tokens = array_count_values($tokens);
	*/

	$tokens = $document;
	[...]
```

My pros
- Strong contract for Documents through interface
- Document is in a frequency state ASAP
- Document API is very wide open (cf Unit Tests)
- Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
- Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
- Agnostic approach using SPL
- One can even use closures/built-in functions for normalizers/tokenizers (faster?)
- Hydrators/Extractors made simplier
- Some more document-level calculations could be done in the instance
- TokenCountByDocument no longer necessary

My cons
- Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
- Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
- Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
- Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
- Your cons?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants