Skip to content

Conversation

@romazu
Copy link

@romazu romazu commented Dec 2, 2025

Related to #24 and #145, but only affects elements inside <author> or <contributor>. This specifically solves the incorrect parsing of arXiv Atom feeds mentioned in #145, where author affiliations were lost.

Problem

Custom namespace elements (e.g. <arxiv:affiliation>) inside <author> or <contributor> are stored at entry level, causing:

  1. Loss of association between authors and their custom data
  2. Overwriting when multiple authors have the same custom element

Scope

This PR does not solve the general problem of how to handle unknown elements (whether to store on the parent, aggregate into a list, or assign as dict fields). However:

  1. It improves the current (incorrect) behavior where child elements overwrite parent-level fields
  2. It does not affect any other behavior — all existing tests pass

Example

<entry>
  <author>
    <name>Alice</name>
    <arxiv:affiliation>MIT</arxiv:affiliation>
  </author>
  <author>
    <name>Bob</name>
    <arxiv:affiliation>Stanford</arxiv:affiliation>
  </author>
</entry>

Before:

entry.authors[0]  # {'name': 'Alice'}
entry.authors[1]  # {'name': 'Bob'}
entry.arxiv_affiliation  # 'Stanford' (overwrites MIT, stored in entry)

After:

entry.authors[0]  # {'name': 'Alice', 'arxiv_affiliation': 'MIT'}
entry.authors[1]  # {'name': 'Bob', 'arxiv_affiliation': 'Stanford'}
'arxiv_affiliation' in entry  # False

Alternatives considered

Adding explicit support for the arXiv namespace (like itunes, dc, media). However, this would increase maintenance burden, and wouldn't help other custom namespaces.

Implementation

The fix is small (~10 lines):

  • Add _maybe_get_author_context() to return current author/contributor dict if inside one
  • Update pop() to store unknown elements in author context when applicable
  • Ensure _end_author() / _end_contributor() exit author/contributor context before pop()

Tests

  • Added four wellformed Atom 1.0 feeds under tests/wellformed/atom10/ to cover custom elements inside <author> and <contributor>.
  • The full test suite still passes.

Custom namespace elements (e.g. arxiv:affiliation) inside <author> or
<contributor> are now stored in the author/contributor dict rather than
at entry level. This preserves the association between authors and their
custom data when multiple authors have the same custom element.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant