Skip to content

Conversation

Copy link

Copilot AI commented Dec 23, 2025

Implementation Plan for Creator Biographical Information

  • Reverted previous commits to start over with better structure
  • 1. Refactor XML injection logic for better extensibility
  • 2. Add linked_agents to resolve parameter
  • 3. Import xml.sax.saxutils.escape
  • 4. Add get_creator_bioghist method
  • 5. Integrate bioghist into XML injection
  • 6. Update comment to reflect new behavior
  • 7. Add copilot-instructions.md with commit guidelines
  • 8. Clarify import guidelines in copilot-instructions.md
  • 9. Add newlines for XML readability
  • 10. Escape XML special characters in all labels

Summary

Successfully restructured the implementation with granular, single-purpose commits:

  • Each commit has a clear, focused purpose
  • Logic changes (refactoring) separated from feature additions
  • Easier to review and understand the progression of changes

Documentation Added

Created .github/copilot-instructions.md with onboarding instructions for GitHub Copilot agents, documenting the granular commit style guidelines including:

  • One commit per logical change
  • Imports included with their usage (not as separate commits)
  • Clear examples of good vs too-dense vs too-granular commits

XML Formatting

  • Injected XML elements now include newlines for better human readability
  • All XML content (recordgroup, subgroup, bioghist) is properly escaped to prevent syntax errors
Original prompt

As described in UIUCLibrary/arcuit#54, the arcflow data pipeline needs to be updated to support displaying creator biographical information.

The implementation should:

  1. Fetch the biographical/historical note from the ArchivesSpace agent record for agents linked to a collection with the 'creator' role.
  2. Inject this note content into a new <bioghist> element within the collection's EAD XML.
  3. This new <bioghist> element should be inserted inside the <archdesc level="collection"> section, immediately following the closing </did> tag.
  4. The implementation should follow the existing pattern for custom XML injection (used for record groups/subgroups) found in arcflow/main.py.

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits December 23, 2025 16:10
- Add get_creator_bioghist method to fetch biographical notes
- Update task_resource to resolve linked_agents
- Inject bioghist element into EAD XML after </did> tag
- Follow existing pattern for custom XML injection

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Use xml.sax.saxutils.escape for proper XML escaping
- Format bioghist notes as separate paragraphs
- Remove manual XML escaping in favor of library function

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Make error message more specific about the operation that failed

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot AI changed the title [WIP] Update arcflow pipeline to display creator biographical information Add creator biographical information to EAD XML exports Dec 23, 2025
Copilot AI requested a review from alexdryden December 23, 2025 16:20
arcflow/main.py Outdated
'ead3': 'false',
})

# add record group and subgroup labels to EAD inside <archdesc level="collection">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update the comments to reflect the new behavior

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit db640ce. The comment now reflects that the section handles multiple types of custom XML elements (record group/subgroup labels and biographical/historical notes).

Copy link
Contributor

@alexdryden alexdryden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more frequent, single purpose commits.

Even if humans committing to this repo have dense commits, I'd like for you to be more explicit. For example, in the first commit to main.py you are making changes to the like locating the closing </did> tag, which is great, but not really part of "Add creator biographical information to EAD XML exports".

This makes it easier to review and helps me understand your specific logic and choices.

Copilot AI and others added 10 commits December 23, 2025 16:48
Restructure the XML injection code to make it easier to add
additional XML elements. Changes:
- Move XML decode/encode outside the label-conditional block
- Calculate insertion point (after </did> tag) upfront
- Use extra_xml variable to accumulate multiple XML elements
- Maintains existing recordgroup/subgroup functionality

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Include linked_agents in the resource API call resolution to
fetch agent information along with the resource. This prepares
for extracting creator biographical information.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Add import for xml_escape function to properly escape XML special
characters (ampersands, less-than, greater-than, quotes) when
injecting content into EAD XML.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Implement method to fetch biographical/historical notes from
ArchivesSpace agent records. The method:
- Filters agents by 'creator' role
- Fetches agent details via API
- Extracts note_bioghist content from subnotes
- Escapes XML special characters using xml_escape
- Wraps each note in <p> tags for proper XML structure
- Returns concatenated paragraphs or None if no notes found

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Call get_creator_bioghist to fetch biographical notes and inject
them into the EAD XML as a <bioghist> element. The bioghist element
is added after </did> tag, following recordgroup/subgroup labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Update the comment on the XML injection section to indicate that
it now handles multiple types of custom XML elements (record group/
subgroup labels AND biographical/historical notes), not just labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more...

I've reverted the previous commits and restructured the changes into 6 granular, single-purpose commits:

  1. b39d03b - Refactor XML injection logic (separates structure changes)
  2. a50571c - Add linked_agents to resolve parameter
  3. 600442f - Import xml.sax.saxutils.escape
  4. d7f1bc1 - Add get_creator_bioghist method
  5. 2ae937e - Integrate bioghist into XML injection
  6. db640ce - Update comment to reflect new behavior

Each commit now has a clear, focused purpose that's easier to review.

@alexdryden
Copy link
Contributor

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consult https://docs.github.com/en/enterprise-cloud@latest/copilot/tutorials/coding-agent/get-the-best-results if you need context for what this file is for. Keep it very simple for now and mainly focus the commit style. Make the file so it is easy and clear for another agent to add more instructions later.

Create onboarding instructions for GitHub Copilot agents with
focus on granular, single-purpose commit style. Includes:
- Guidelines for commit granularity
- Examples of good vs bad commit sequences
- Commit message format recommendations
- Instructions for extending the file

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consu...

Created in commit b485f55. The file documents the granular commit style with clear examples and is structured to be easily extended by other agents.

@alexdryden
Copy link
Contributor

@copilot This is great! Lets tune it a bit more. I noticed in your granular commits you have a commit for adding an import--that is too granular. You can silently add the import in the same commit where it is used. Update the instructions to make that clear.

Update copilot-instructions.md to specify that imports should
be added in the same commit where they're used, not as separate
commits. Added guideline and examples showing both good and
too-granular commit patterns.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot This is great! Lets tune it a bit more. I noticed in your granular commits you have a commit for adding an import--that is too granular. You can silently add the import in the same commit whe...

Updated in commit 76ac751. Added guideline "Include imports with usage" and examples showing both too-dense and too-granular commit patterns to clarify the right balance.

@alexdryden
Copy link
Contributor

Okay, this all looks good! @copilot I do notice that the xml output looks like this:

<?xml version="1.0" encoding="utf-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 https://www.loc.gov/ead/ead.xsd"><eadheader countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b" repositoryencoding="iso15511"><eadid countrycode="US" mainagencycode="US-IU-Ar">UA.30.2.802</eadid><filedesc><titlestmt><titleproper type="filing">Illi Notes</titleproper><titleproper>Guide to the Illi Notes, 1963- <num>UA.30.2.802</num></titleproper></titlestmt><publicationstmt><publisher>University of Illinois Archives [UTF-8 test, fall 2023 data]</publisher><address><addressline>19 Library</addressline><addressline>1408 W. Gregory Dr.</addressline><addressline>Urbana, IL 61820</addressline><addressline>Business Number: (217) 333-0798</addressline><addressline>Fax Number: (217) 333-2868</addressline><addressline>illiarch@illinois.edu</addressline><addressline>URL: <extptr xlink:href="http://archives.library.illinois.edu" xlink:show="new" xlink:title="http://archives.library.illinois.edu" xlink:type="simple"/></addressline></address></publicationstmt></filedesc><profiledesc><creation>This finding aid was produced using ArchivesSpace on <date>2025-12-23 13:58:53 -0600</date>.</creation><langusage><language langcode='eng' scriptcode='Latn'>English</language></langusage></profiledesc></eadheader><archdesc level="collection">
  <did>
    <repository>
      <corpname>University of Illinois Archives [UTF-8 test, fall 2023 data]</corpname>
    </repository>
    <unittitle>Illi Notes</unittitle>
    <origination label="Creator">
      <corpname source="local">Moms Association</corpname>
    </origination>
    <unitid>UA.30.2.802</unitid>
    <unitid type="aspace_uri">/repositories/5/resources/1223</unitid>
    <physdesc altrender="whole">
      <extent altrender="materialtype spaceoccupied">0.60 Cubic Feet</extent>
    </physdesc>
    <unitdate datechar="creation" normal="1963/2025" type="inclusive">1963-</unitdate>
    <langmaterial id="aspace_7fff13d65a3cc22e093c56381ddc2c65"><language langcode='eng'>English</language></langmaterial>
    <container id="aspace_545f35486e54f5713a63e4c7b8d8778d" label="Mixed Materials" type="box">1</container>
    <container id="aspace_4d98edffae401c705a8d1f96ab9e1393" label="Mixed Materials" type="box">2</container>
  </did><recordgroup>UA 30 — Dads & Mothers Association</recordgroup><subgroup>UA 30.2 — Mothers Association</subgroup><bioghist><p>The University of Illinois Mothers Association was founded in 1923 after two years under the leadership of the Women's League.&lt;emph render='super'&gt;1&lt;/emph&gt; It is a non-profit organization dedicated to service by mothers to the University of Illinois and its students.&lt;emph render='super'&gt;2&lt;/emph&gt; The association is governed by an executive board consisting of a president, 3 vice-presidents, four directors (selected on an area basis), a treasurer, and an executive secretary.&lt;emph render='super'&gt;3&lt;/emph&gt; Funds are deposited with the University Business Office.&lt;emph render='super'&gt;4&lt;/emph&gt; Each county in the state has at least one county chairwoman.&lt;emph render='super'&gt;5&lt;/emph&gt; Membership is open to mothers of all university students.&lt;emph render='super'&gt;6&lt;/emph&gt; There is a quarterly newsletter, Illi Notes, which was first published in 1963, and is mailed to all contributing members.&lt;emph render='super'&gt;7&lt;/emph&gt; Some programs are co-sponsored with the Dads Association but each organization maintains separate administrative and continuing activities.&lt;emph render='super'&gt;8&lt;/emph&gt; Dues cover the cost of programs and activities, one of which is an annual scholarship for the amount of tuition and fees. This scholarship was first awarded in 1952.&lt;emph render='super'&gt;9&lt;/emph&gt;
	1. County Chairwoman's Handbook, 1971-1972, p. 4.
	2. Ibid., title page.
	3. Ibid., p. 5.
	4. Ibid., p. 5.
	5. Ibid., p. 5.
	6. University of Illinois Mothers Association (pamphlet)
	7. County Chairwoman's Handbook, 1971-1972, p. 10.
	8. University of Illinois Mothers Association (pamphlet)
	9. County Chairwoman's Handbook, 1971-1972, p. 10.</p></bioghist>
  <scopecontent id="aspace_5fc120c888db22dbfe899d0a2259be6d">
    <head>Scope and Contents</head>
<p>Illi Notes published for mothers of Illinois students four times a year since November 1963, and including "letters" from the Dean of Students, Dean of Men and Dean of Women; student morality; articles on Mothers Association activities and area meetings; Mother's Day; scholarships and awards. This series includes newsletters that are published jointly (beginning in 1984-) by the Dads and Mothers Associations.</p>  </scopecontent>
  <arrangement id="aspace_d8c176e1721d16ddc37e2d9c14f77adf">
    <head>Arrangement Note</head>
<p>Chronological</p>  </arrangement>
  <accruals id="aspace_7043a9d7699d9b2eaf1ef013a4f6dbff">
    <head>Accruals and Additions</head>
<p>9/1963, 2003</p>  </accruals>
  <controlaccess>
    <subject source="local">Mothers Association</subject>
  </controlaccess>
  <dsc/>
</archdesc>
</ead>

where there are no line breaks between the element being injected. This is mostly just a cosmetic issue, but the EAD is frequently referenced by humans, so I think this is an important readability issue.

Add newline characters before each injected XML element
(recordgroup, subgroup, bioghist) to improve readability
of the generated EAD XML for human reviewers.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

Okay, this all looks good! @copilot I do notice that the xml output looks like this:

...

Fixed in commit 0fd29ad. Added newline characters before each injected XML element so they appear on separate lines in the output for better readability.

@alexdryden
Copy link
Contributor

@copilot Great! Looks good. So, from that same example I'm seeing that there is an & in what is now line 21:

<?xml version="1.0" encoding="utf-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 https://www.loc.gov/ead/ead.xsd"><eadheader countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b" repositoryencoding="iso15511"><eadid countrycode="US" mainagencycode="US-IU-Ar">UA.30.2.802</eadid><filedesc><titlestmt><titleproper type="filing">Illi Notes</titleproper><titleproper>Guide to the Illi Notes, 1963- <num>UA.30.2.802</num></titleproper></titlestmt><publicationstmt><publisher>University of Illinois Archives [UTF-8 test, fall 2023 data]</publisher><address><addressline>19 Library</addressline><addressline>1408 W. Gregory Dr.</addressline><addressline>Urbana, IL 61820</addressline><addressline>Business Number: (217) 333-0798</addressline><addressline>Fax Number: (217) 333-2868</addressline><addressline>illiarch@illinois.edu</addressline><addressline>URL: <extptr xlink:href="http://archives.library.illinois.edu" xlink:show="new" xlink:title="http://archives.library.illinois.edu" xlink:type="simple"/></addressline></address></publicationstmt></filedesc><profiledesc><creation>This finding aid was produced using ArchivesSpace on <date>2025-12-23 14:20:30 -0600</date>.</creation><langusage><language langcode='eng' scriptcode='Latn'>English</language></langusage></profiledesc></eadheader><archdesc level="collection">
  <did>
    <repository>
      <corpname>University of Illinois Archives [UTF-8 test, fall 2023 data]</corpname>
    </repository>
    <unittitle>Illi Notes</unittitle>
    <origination label="Creator">
      <corpname source="local">Moms Association</corpname>
    </origination>
    <unitid>UA.30.2.802</unitid>
    <unitid type="aspace_uri">/repositories/5/resources/1223</unitid>
    <physdesc altrender="whole">
      <extent altrender="materialtype spaceoccupied">0.60 Cubic Feet</extent>
    </physdesc>
    <unitdate datechar="creation" normal="1963/2025" type="inclusive">1963-</unitdate>
    <langmaterial id="aspace_7fff13d65a3cc22e093c56381ddc2c65"><language langcode='eng'>English</language></langmaterial>
    <container id="aspace_fd59649f8759b8e14f0f73c41efbac20" label="Mixed Materials" type="box">1</container>
    <container id="aspace_9e37d48141a4c0ae13ee3b001e8842e8" label="Mixed Materials" type="box">2</container>
  </did>
<recordgroup>UA 30 — Dads & Mothers Association</recordgroup>
<subgroup>UA 30.2 — Mothers Association</subgroup>
<bioghist><p>The University of Illinois Mothers Association was founded in 1923 after two years under the leadership of the Women's League.&lt;emph render='super'&gt;1&lt;/emph&gt; It is a non-profit organization dedicated to service by mothers to the University of Illinois and its students.&lt;emph render='super'&gt;2&lt;/emph&gt; The association is governed by an executive board consisting of a president, 3 vice-presidents, four directors (selected on an area basis), a treasurer, and an executive secretary.&lt;emph render='super'&gt;3&lt;/emph&gt; Funds are deposited with the University Business Office.&lt;emph render='super'&gt;4&lt;/emph&gt; Each county in the state has at least one county chairwoman.&lt;emph render='super'&gt;5&lt;/emph&gt; Membership is open to mothers of all university students.&lt;emph render='super'&gt;6&lt;/emph&gt; There is a quarterly newsletter, Illi Notes, which was first published in 1963, and is mailed to all contributing members.&lt;emph render='super'&gt;7&lt;/emph&gt; Some programs are co-sponsored with the Dads Association but each organization maintains separate administrative and continuing activities.&lt;emph render='super'&gt;8&lt;/emph&gt; Dues cover the cost of programs and activities, one of which is an annual scholarship for the amount of tuition and fees. This scholarship was first awarded in 1952.&lt;emph render='super'&gt;9&lt;/emph&gt;
	1. County Chairwoman's Handbook, 1971-1972, p. 4.
	2. Ibid., title page.
	3. Ibid., p. 5.
	4. Ibid., p. 5.
	5. Ibid., p. 5.
	6. University of Illinois Mothers Association (pamphlet)
	7. County Chairwoman's Handbook, 1971-1972, p. 10.
	8. University of Illinois Mothers Association (pamphlet)
	9. County Chairwoman's Handbook, 1971-1972, p. 10.</p></bioghist>
  <scopecontent id="aspace_5fc120c888db22dbfe899d0a2259be6d">
    <head>Scope and Contents</head>
<p>Illi Notes published for mothers of Illinois students four times a year since November 1963, and including "letters" from the Dean of Students, Dean of Men and Dean of Women; student morality; articles on Mothers Association activities and area meetings; Mother's Day; scholarships and awards. This series includes newsletters that are published jointly (beginning in 1984-) by the Dads and Mothers Associations.</p>  </scopecontent>
  <arrangement id="aspace_d8c176e1721d16ddc37e2d9c14f77adf">
    <head>Arrangement Note</head>
<p>Chronological</p>  </arrangement>
  <accruals id="aspace_7043a9d7699d9b2eaf1ef013a4f6dbff">
    <head>Accruals and Additions</head>
<p>9/1963, 2003</p>  </accruals>
  <controlaccess>
    <subject source="local">Mothers Association</subject>
  </controlaccess>
  <dsc/>
</archdesc>
</ead>

Look back through, I thought I recalled this being handled for record group/subgroup at some point in the past. Make sure that you haven't bypassed some xml formatting through one of your changes. Report back what you find along with a proposed fix.

Apply xml_escape to recordgroup and subgroup labels to prevent
XML syntax errors when labels contain special characters like
&, <, or >. This was a pre-existing issue but is now fixed to
be consistent with bioghist content escaping.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Great! Looks good. So, from that same example I'm seeing that there is an & in what is now line 21:

<?xml version="1.0" encoding="utf-8"?>

...

Findings: The recordgroup/subgroup labels were never being XML-escaped in the original code - this was a pre-existing issue, not introduced by my refactoring. However, my bioghist implementation properly uses xml_escape for its content.

Fix: Applied in commit eabad7c. Now using xml_escape() on both rg_label and sg_label before inserting them into the XML, consistent with how bioghist content is handled. This prevents XML syntax errors when labels contain &, <, >, or other special characters.

@alexdryden
Copy link
Contributor

@copilot Great! Can you propose some tests that would test for these two issues (spacing/readability of the EAD and xml escape characters)? You can think about this request broadly, e.g., should the code be refactored into something that is more testable? Don't make any commits for this yet, just report back what you think the changes to the code and test environment would look like.

Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Great! Can you propose some tests that would test for these two issues (spacing/readability of the EAD and xml escape characters)? You can think about this request broadly, e.g., should the c...

Test Proposal

Current State Analysis

The XML injection logic is currently embedded in task_resource() method (lines 229-263), making it difficult to test without:

  • Live ArchivesSpace API access
  • Full ArcFlow instance initialization
  • File system dependencies

Recommended Refactoring for Testability

1. Extract XML injection into pure function:

def inject_custom_xml_elements(xml_content: str, rg_label: str = None, 
                                sg_label: str = None, bioghist_content: str = None) -> str:
    """
    Inject custom XML elements after </did> tag.
    Returns modified XML string with proper escaping and formatting.
    """
    # Current lines 232-259 logic here
    # Pure function - no self, no side effects

2. Extract bioghist formatting into testable method:

def format_bioghist_paragraphs(bioghist_notes: list) -> str:
    """Format biographical notes as escaped XML paragraphs."""
    # Current logic from get_creator_bioghist lines 549-555

Proposed Tests

Test file: tests/test_xml_injection.py

import pytest
from arcflow.main import inject_custom_xml_elements, format_bioghist_paragraphs

class TestXMLInjection:
    def test_xml_escaping_in_recordgroup(self):
        """Test that special chars in recordgroup are escaped"""
        xml = '<archdesc level="collection"><did></did></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="Group & Co")
        assert '&amp;' in result
        assert '<recordgroup>Group &amp; Co</recordgroup>' in result
    
    def test_xml_escaping_in_subgroup(self):
        """Test that < and > are escaped"""
        xml = '<archdesc level="collection"><did></did></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="RG1", sg_label="Sub<>group")
        assert '&lt;' in result and '&gt;' in result
    
    def test_newlines_between_elements(self):
        """Test readability: elements on separate lines"""
        xml = '<archdesc level="collection"><did></did><scope></scope></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="RG1", sg_label="SG1", 
                                            bioghist_content="<p>Bio</p>")
        lines = result.split('\n')
        assert any('<recordgroup>' in line for line in lines)
        assert any('<subgroup>' in line for line in lines)
        assert any('<bioghist>' in line for line in lines)
    
    def test_bioghist_escaping(self):
        """Test that bioghist content is properly escaped"""
        notes = ["John & Jane", "Born <1900", "Died >2000"]
        result = format_bioghist_paragraphs(notes)
        assert '&amp;' in result
        assert '&lt;' in result
        assert '&gt;' in result

Test Infrastructure Setup

Create:

  • tests/ directory
  • tests/__init__.py
  • tests/test_xml_injection.py
  • pytest.ini or setup.cfg for configuration

Update requirements.txt:

ArchivesSnake
pyyaml
pytest>=7.0.0
pytest-cov>=4.0.0  # for coverage reporting

Benefits

  • Tests run without ArchivesSpace connection
  • Fast feedback (milliseconds vs seconds/minutes)
  • Can test edge cases easily (malicious input, unicode, etc.)
  • CI/CD integration ready
  • Regression prevention for formatting/escaping issues

@alexdryden
Copy link
Contributor

@copilot Okay, I like where you are headed in terms of extracting injection logic and seperating concerns, but any kind of custom injection funtion should not take these super specific parameters that are tied to specific types of things we want to inject.

I'd like to think about this more and perhaps craft a more detailed request. Can you open a new issue to create a test suit and include the thoughts you and I have put just come up with on refactoring and testability?

Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Okay, I like where you are headed in terms of extracting injection logic and seperating concerns, but any kind of custom injection funtion should not take these super specific parameters that...

I don't have permission to create GitHub issues directly, but here's the issue content for you to create:


Title: Add test suite with refactored XML injection logic

Labels: enhancement, testing

Description:

Summary

Create a comprehensive test suite for XML injection functionality with refactored code to improve testability and maintainability.

Background

During PR review for adding creator biographical information (#[PR_NUMBER]), we identified that the XML injection logic is tightly coupled with the ArcFlow class, making it difficult to test without:

  • Live ArchivesSpace API connection
  • Full application initialization
  • File system dependencies

Goals

  1. Refactor for testability: Extract XML injection logic into pure, testable functions
  2. Create test infrastructure: Set up pytest framework with appropriate configuration
  3. Implement comprehensive tests: Cover XML escaping, formatting, and edge cases
  4. Enable CI/CD: Make tests runnable in automated pipelines

Proposed Refactoring

Design Principle

Create a flexible XML injection system that doesn't tie parameters to specific element types. Instead of:

inject_custom_xml_elements(xml_content, rg_label, sg_label, bioghist_content)  # Too specific

Use a more generic approach:

inject_xml_elements(xml_content, elements: List[XMLElement], insertion_point: str)

Where XMLElement could be a simple data structure:

@dataclass
class XMLElement:
    tag: str
    content: str
    escape: bool = True

Key Functions to Extract

  1. XML element injection - Pure function handling insertion logic
  2. XML escaping - Reusable escaping utility
  3. Bioghist formatting - Separate from API fetching logic

Test Coverage Needed

Critical Tests

  • XML Escaping: Verify special characters (&, <, >, ", ') are properly escaped
  • Readability: Confirm newlines between injected elements
  • Edge Cases: Empty values, unicode, malicious input
  • Integration: Verify correct insertion point (after </did> tag)

Test File Structure

tests/
├── __init__.py
├── test_xml_injection.py
├── test_xml_escaping.py
└── fixtures/
    └── sample_ead.xml

Implementation Checklist

  • Extract XML injection into pure functions
  • Design flexible element injection API
  • Set up pytest infrastructure
  • Add pytest and pytest-cov to requirements.txt
  • Implement unit tests for XML escaping
  • Implement unit tests for formatting/readability
  • Add integration tests with sample EAD fixtures
  • Configure CI/CD to run tests automatically
  • Document testing approach in README

Benefits

  • Fast test execution (no external dependencies)
  • Easy to add new XML element types
  • Regression prevention
  • Improved code maintainability
  • Better separation of concerns

After you create this issue, I can reference it in future work on this topic.

@alexdryden alexdryden marked this pull request as ready for review December 23, 2025 20:49
@alexdryden
Copy link
Contributor

@copilot can you briefly summarize in a comment the changes the pr makes to help support reviews

Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot can you briefly summarize in a comment the changes the pr makes to help support reviews

PR Summary

This PR adds creator biographical information to EAD XML exports by extracting note_bioghist content from ArchivesSpace agent records.

Core Functionality Changes (arcflow/main.py)

  1. New method get_creator_bioghist() - Fetches biographical notes from agents with 'creator' role, extracts content, and formats as XML paragraphs with proper escaping
  2. Updated task_resource() - Adds linked_agents to resolve parameter and calls new bioghist method
  3. Enhanced XML injection - Injects <bioghist> elements after </did> tag alongside existing recordgroup/subgroup elements

Quality Improvements

  • XML escaping: All injected content (recordgroup, subgroup, bioghist) properly escaped using xml.sax.saxutils.escape() to prevent syntax errors
  • Readability: Newlines added between injected XML elements for human readability
  • Code structure: Refactored XML injection logic for better extensibility

Documentation

  • Added .github/copilot-instructions.md - Onboarding guide for GitHub Copilot agents with commit style guidelines

Files Changed

  • arcflow/main.py: +74 lines (new method, XML injection updates, escaping)
  • .github/copilot-instructions.md: +65 lines (new file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants