Conversation
| * `previous() -> Content?` | ||
| * Returns the previous `Content` element in the iterator, or `null` when reaching the beginning. | ||
| * `next() -> Content?` | ||
| * Returns the next `Content` element in the iterator, or `null` when reaching the end. |
There was a problem hiding this comment.
You should make it explicit that these methods make the iterator move forward or backward.
There was a problem hiding this comment.
In the documentation comment? I'm fine making it more explicit there.
I wouldn't change the function signature as it's pretty common though:
There was a problem hiding this comment.
Yeah, I was talking about the documentation.
| ##### Properties | ||
|
|
||
| * `locator: Locator` | ||
| * Locator targeting this element in the `Publication`. |
There was a problem hiding this comment.
Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.
There was a problem hiding this comment.
I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so.
With image elements, you can use a cssSelector in the locations object. I filled it for all locators in the HtmlResourceContentIterator, as even with a text context it can help limit the search scope to a parent element.
If there's ever a case where we can't target precisely, this Locator should at least match the closest targetable parent.
| * `style: TextStyle` | ||
| * Semantic style for this element. | ||
| * `spans: [TextSpan]` | ||
| * List of text spans in this element. |
There was a problem hiding this comment.
What's the rationale behind grouping multiple TextSpans into a single Content.Data.Text?
There was a problem hiding this comment.
For example if you have the body paragraph:
<body lang="en">
<p>The correct pronunciation is <span lang="fr">croissant</span>, and not croissant.</p>
</body>
We want the whole paragraph as a single semantic Text element. However to not loose the information about the language (which is important for TTS), you can split the Text element into a list of spans which are basically "attributed ranges" in the text element:
{
"locator": {},
"data": {
"style": "body",
"spans": [
{ "text": "The correct pronunciation is ", "language": "en" },
{ "text": "croissant", "language": "fr" },
{ "text": ", and not croissant.", "language": "en" }
]
}
}Like @danielweck mentioned on the call, the term span can be confusing when parsing HTML, as they are no 1-1 relationship between the HTML spans and the produced elements.
However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:
- https://developer.android.com/guide/topics/text/spans
- https://api.flutter.dev/flutter/painting/TextSpan-class.html
I think it even matches the meaning from the HTML spec:
The span element doesn't mean anything on its own, but can be useful when used together with the global attributes, e.g. class, lang, or dir.
https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-span-element
There was a problem hiding this comment.
@danielweck I'm renaming span into segment to lift any ambiguity. (Other candidates: part, chunk, portion)
There was a problem hiding this comment.
I will also rename style into role, to remove the notion of appearance that style brings.
|
|
||
| ### Extracting the data from `Content` elements | ||
|
|
||
| The `Content` elements are value objects containing: |
There was a problem hiding this comment.
Some new elements to take into account:
- SVG
- can have an
aria-labelortitlechild element
- can have an
- MathML
- extracting the
ttchild elements could be useful
- extracting the
- MathJax
See the full proposal
Summary
This proposal introduces a new Publication Service to iterate through a publication's content extracted as semantic elements.
The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as:
Today, implementing such features is complex because you need to:
The Content Iterator handles all that in a media type agnostic way.
Review notes
Contentclass and its components need some review to account for all needs.