Skip to content

feat(Crossref): add CrossrefPIDProvider#2177

Closed
mfenner wants to merge 21 commits intoinveniosoftware:masterfrom
front-matter:crossref_pid_provider
Closed

feat(Crossref): add CrossrefPIDProvider#2177
mfenner wants to merge 21 commits intoinveniosoftware:masterfrom
front-matter:crossref_pid_provider

Conversation

@mfenner
Copy link
Copy Markdown
Contributor

@mfenner mfenner commented Oct 12, 2025

  • register DOIs with Crossref, functionality analogous to DataCitePIDProvider
  • optionally multiple prefixes with per community prefix
  • CrossrefXMLSerializer with metadata export

❤️ Thank you for your contribution!

Description

This pull-request adds Crossref DOI registration support to InvenioRDM. The configuration is very similar to DataCite DOI registration, but multiple prefixes are supported via CROSSREF_PREFIXES. The new PID provider should not interfere with DataCite DOI registration. DOI registration with both DataCite and Crossref has not been tested and needs additional work (decision tree which PID provider is used for a record). Multiple prefixes with one prefix per community are currently supported via a community custom_field.

The Crossref PID provider makes heavy use of the commonmeta-py library, similar to how the datacite library is used in the DataCite PID provider. commonmeta-py already was a dependency, but a newer version (>= 0.183) is required for Crossref DOI registration functionality.

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

* register DOIs with Crossref, functionality analogous to DataCitePIDProvider
* optionally multiple prefixes with per community prefix
* CrossrefXMLSerializer with metadata export
* register DOIs with Crossref, functionality analogous to DataCitePIDProvider
* optionally multiple prefixes with per community prefix
* CrossrefXMLSerializer with metadata export
Comment thread invenio_rdm_records/resources/serializers/crossref/__init__.py Outdated
Comment thread invenio_rdm_records/resources/serializers/crossref/__init__.py
Comment thread invenio_rdm_records/services/pids/providers/crossref.py Outdated
Comment thread invenio_rdm_records/services/pids/providers/crossref.py Outdated
Comment on lines +105 to +110
community_prefix = dig(community.data, "custom_fields.rs:prefix")
if community_prefix and community_prefix in [str(p) for p in prefixes]:
prefix = str(community_prefix)
current_app.logger.debug(
f"Using DOI prefix: {prefix} for community: {default_community}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to discuss this internally (ping @slint) because there is a hidden assumption that this custom_field will exist.
I am wondering if we have a better way to handle this, for example by allowing a different custom field name, or using a config map instead of a custom field (just ideas).
Does this support minting it without a community?

Copy link
Copy Markdown
Contributor Author

@mfenner mfenner Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks for the custom_field and falls back to the first prefix defined. My assumption was that a future update to invenio-communities would provide the prefix, either as an optional standard field or as a built-in custom field similar to the journal custom field. The current implementation works exactly as DATACITE_PREFIX if only one Crossref prefix is defined.

**options,
)

def dump_obj(self, record):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function looks not really as a invenio serializer looks like. for me it is to much going on in the dump_obj method.

would it be possible to move this stuff into the CrossrefXMLSchema or a subclass created in rdm-records like class RDMCrossrefXMLSchema(CrossrefXMLSChema):

in my thinking it should not be necessary to override the dump_obj method. it schema.dump should do the magick and return a xml if that is necessary.

Copy link
Copy Markdown
Contributor Author

@mfenner mfenner Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the dump_obj method, using the write_crossref_xml method from commonmeta-py instead. What makes serialization complicated is that Crossref XML has a different field order depending on content type, therefore the field order has to be changed depending on type before XML serialization.

@mfenner mfenner force-pushed the crossref_pid_provider branch from 899d7f6 to d26656d Compare October 18, 2025 08:20
@mfenner mfenner force-pushed the crossref_pid_provider branch from 4e4fddf to a380b57 Compare October 24, 2025 16:54
@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 28, 2025

@slint what is your recommendation regading allowing prefixes per community? Currently this requires a community custom_field, falling back to one Crossref prefix per repo - the same as the DataCite PID Provider. Should prefix become an optional standard field in invenio-communities? Should this also work with the DataCite PID Provider?

Or should this be handled internally in the Crossrer PID Provider? I could write an optional function for the Crossref PID Provider that determines which prefix to use. And this could be overwritten in invenio.cfg, e.g. if the prefix is determined not by community but some other criterium, e.g. content type. In this case no changes to invenio-communities would be needed, and at a later time similar functionality could be implemented for the DataCite PID Provider.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 28, 2025

@ntarocco @slint I looked into how I can determine the prefix based on the default community without code changes to invenio-communities or custom_fields.

In my instance I currently have about 175 blog communities and almost all use a standard prefix that could be provided by CROSSREF_PREFIX. Eight (soon 10) communities use a custom prefix. Instead of providing the prefix as community metadatra, I could also define a dict configuration variable, e.g. CROSSREF_PREFIXES, that maps community slug to prefix and encode the logic in the generate_doi method of the CrossrefClient (where the community lookup currently happens). If users want to have a different logic for multiple prefixes, and/or wanted to do the same for DataCite DOIs, they could overwrite the generate_doi method in the Crossref DOI Provider (and/or DataCite DOI provider).

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 28, 2025

For the use case of using the DataCite and Crossref (and External) DOI Provider in the same instance, it may make sense to implement the "pick the prefix" logic further upstream, i.e. in the Base Provider generate_id method. Currently the method raises an NotImplementedError as it is handled by the specific DOI Provider, but there could be optional additional logic based on configuration variables.

The generate_doi method in the Crossref DOI Provider could then be as simple as in the DataCite DOI Provider, with the "prefix picked" based in via self.

@tmorrell
Copy link
Copy Markdown
Contributor

I'm thinking the prefix needs to be set at the provider level, and then there needs to be a way to configure what provider gets used. A map from community slug to provider could be a simple initial implementation. For DataCite the prefix gets passed into the client, so I don't think it could just live in the generate_id method without a lot of refactoring.

@ntarocco
Copy link
Copy Markdown
Contributor

ntarocco commented Oct 29, 2025

I would add that it depends if this should be a user choice, or not. In general, once the prefix and provider is configured for your community, it is hardly changed. However, should the community's manager define its own prefix and provider from the web interface?

Given that DOI minting requires a contract with the provider, I would say that users should not be able to configure it.
If we agree with that, I would go with a config mapping as mentioned above.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 29, 2025

Thinking more about this, I would suggest to separate out the "prefix per community" into a separate PR, as this is not specific to Crossref but equally could be used with DataCite DOIs. I would then refactor this PR to do DOI assignment basically as in the DataCite DOI Provider. Until this second PR is approved and merged, I can use site-specific code for the prefix by community feature in my instance.

For. the "prefix per community" second PR would would be good to decide beforehand whether a) we first want to update invenio-communities to include optional prefix metadata, b) whether this PR should also cover using both DataCite and Crossref in one instance as this use case is closely related (use DataCite for the instance with the exception of a handful communities that use Crossref).

And I agree with Nico that this doesn't need to be configured in the UI, a COMMUNITY_PREFIXES configuration option that maps communities to prefixes that override the default (ideally using the community slug and not id) would also work

@ntarocco
Copy link
Copy Markdown
Contributor

Before you go ahead, another question: in InvenioRDM, you can move records from one community to the other (I don't mean including them in multiple communities, I mean transfer it from the main community). How do we deal with that? Does it create any issue if the DOI was minted with the community prefix, and then transferred to another community? What happens if the user uploads a second version of the record after the transfer? The record will have a different prefix.
One way to solve this could be the following:

  1. First upload: use the community prefix and provider if defined, or the global one.
  2. Next versions: use the parent/concept DOI prefix and provider, ignoring the community configuration.

a) we first want to update invenio-communities to include optional prefix metadata

If we don't want the user to act on it, is there any advantage to add the prefix as community's metadata? I would only display the configuration in the community's settings, as a greyed out info, disabled, cannot be changed. Just for information (if really interesting to show?)

b) whether this PR should also cover using both DataCite and Crossref in one instance as this use case is closely related

For me, the important part is to keep PRs small, clean, easy to review and understand the impact. There are more chances that it can be merged more quickly.

ideally using the community slug and not id)

Watch out! The community slug can be changed by the user via the web interface. We use the UUID everywhere instead for that reason.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 29, 2025

@ntarocco you have convinced me that there really is no strong reason to have prefixes configured in invenio-communities, but that a central config variable mapping id (not slug!) to prefix makes more sense. And that a prefix should never change if a record default community is changed or the record gets a new version. Enough detail to think about that separating this out into a separate PR makes sense.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 29, 2025

I have now refactored the pull request and deployed to my production instance.

The prefix by community logic is now implemented in a RDM_COMMUNITY_DOI_PREFIXES config variable, which is optional and a dict that maps community_ids to DOI prefixes. It is implemented in the base class for the PIDProvider, so works for the Crossref and DataCite DOI Provider (and hopefully also for a combination, but not tested). I had to make a very small change to the DataCitePIDProvider, adding a kwargs argument that can pass in the community prefix to the DataCiteClient generate_doi method.

The CrossrefPIDProvider is now simplified and looks very similar to the DataCite Provider, no use of community custom_fields and the prefixes config variable has been renamed to CROSSREF_PREFIX.

I ended up not separating this into another pull request as the code change is minimal (optional new config variable and 8 lines of new code in the base class of the PIDProvider, and we are having a longer discussion on this in this PR already. Plus, this makes it easier for me to run this code in a live instance.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Oct 29, 2025

Something to keep in mind in this multiple DOI prefix discussion: while this makes DOI registration more complicated, it is not a cost factor. Prefixes are free with DataCite and Crossref (and they are not charged by the DOI Foundation), so if your use case requires multiple DOI prefixes, go for it.

* enable crossref pid provider in default config
* add crossref_mock_client
@mfenner mfenner force-pushed the crossref_pid_provider branch from 274695b to 08b278a Compare October 29, 2025 18:00
@mfenner mfenner force-pushed the crossref_pid_provider branch 3 times, most recently from fd28b4a to a134ae6 Compare November 3, 2025 12:13
@slint
Copy link
Copy Markdown
Member

slint commented Nov 5, 2025

I'll drop my two cents here:

  1. I think having a CrossRef DOI provider as an alternative option for registering DOIs for InvenioRDM is a good feature. Instances should be able to swap the DataCite DOI provider with it as the default DOI provider, and things should work seamlessly.
  2. Regarding having support for more complex DOI provider and/or DOI prefix selection based on complex criteria (e.g. record community, resource type, etc.), I would either:
    1. implement a base class DynamicDOIProvider that downstream users can configure with a list of DOI providers + their function/logic that handles which of the providers should be used.
    2. not do this at all, but make it clear in code/docs that if one wants to do so, they should do so by implementing a custom DOI provider that handles their logic (to replace the vanilla DataCitePIDProvider

Given the complexity (but also good set of use-cases) of DynamicDOIProvider, I think it would be better to discuss that in a separate PR/issue/discussion.

@mfenner mfenner force-pushed the crossref_pid_provider branch from a134ae6 to b4e3300 Compare November 10, 2025 11:31
@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Nov 10, 2025

@slint thank you for your feedback. I have removed all code that enables multiple DOI prefixes, and will implement this in a separate PR. The Crossref DOI Provider is now a drop-in replacement for the DataCite DOI Provider (with the limitation that Crossref only supports textual resource types).

@mfenner mfenner force-pushed the crossref_pid_provider branch from b4e3300 to 05a0f83 Compare November 13, 2025 17:31
@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Nov 13, 2025

I now have implemented a CustomPIDProvider class (currently in invenio.cfg) as a subclass of the CrossrefPIDProvider that overrides the generate_id method to support community-specific DOI prefixes. This should work the same way to support community-specific DOI prefixes with the DataCitePIDProvider.

@mfenner mfenner force-pushed the crossref_pid_provider branch 6 times, most recently from 3aaf5e3 to cafce5b Compare November 30, 2025 18:02
@mfenner mfenner force-pushed the crossref_pid_provider branch from cafce5b to 9ef5c19 Compare November 30, 2025 18:25
@utnapischtim utnapischtim added this to v14 Dec 19, 2025
@utnapischtim utnapischtim moved this to 👀 In review in v14 Dec 19, 2025
@github-actions
Copy link
Copy Markdown
Contributor

This PR was automatically marked as stale.

@github-actions github-actions bot added the stale label Jan 30, 2026
@tmorrell
Copy link
Copy Markdown
Contributor

tmorrell commented Feb 2, 2026

This needs a rebase. And we should really move this forward since it's now a fairly straightforward addition of CrossRef in a way that doesn't impact the rest of RDM.

@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Feb 2, 2026

I can try that rebase tomorrow.

@github-actions github-actions bot removed the stale label Feb 3, 2026
@mfenner
Copy link
Copy Markdown
Contributor Author

mfenner commented Feb 3, 2026

Closing this PR in favor of #2249 as too many code changes have happened since originally opening the PR.

@mfenner mfenner closed this Feb 3, 2026
@github-project-automation github-project-automation bot moved this from 👀 In review to To release 🤖 in v14 Feb 3, 2026
@mfenner mfenner mentioned this pull request Feb 3, 2026
10 tasks
@mfenner mfenner deleted the crossref_pid_provider branch February 27, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To release 🤖

Development

Successfully merging this pull request may close these issues.

5 participants