prodigy data-to-spacy - retain metadata information

Hi,

I'm converting my Prodigy annotation output into a spaCy json . However, is there a way of pulling through any of the 'meta' information, please?

Thanks

Anna

Hi! This is an interesting idea! We should probably at least store the hashes of the merged examples somewhere with each example in the data, which would allow you to look them up afterwards and fetch any meta information from the examples. This is a bit more flexible because we don't need any restrictions around where the metadata should be.

In spaCy v3, the training data are regular Doc objects, so we could then store the hashes in the user data and/or Prodigy-specific extension attributes.

Btw what's your end goal and what are you looking to do with the meta information later on?

Hi Ines,

I have annotated data from a different source which is in a very challenging format. I’m combining it with more annotations done in Prodigy but need to dedupe for document titles.

Sounds like the easiest may be that I do the deducing before I convert into spacy format when I still have the meta data.

Thanks

Anna

Ah, yes, in that case, it sounds like the easiest solution would be to do the deduplication before the conversion and save the result to a new dataset. (You could even do this programmatically via Prodigy's Database API.)