Processing annotated data


I have used Prodigy to annotate and train a Named Entity Recognition model.
Now, i want to use that dataset as my input data to annotate and train on top of it a custom Relation Extraction model (a mix of RelEx and Coref).
So i use rel.manual myRelDataset dataset:myNERDataset --label myLabels (which i have tried and works as intended). I see the data with the correct Named Entities, so all i need to do is, annotate the relationships between them, when needed (no need to use --span-label).

Before moving forward, I decided to do some processing on the data to facilitate the annotation.
First, i assume that since i only want relationships between Named Entities, it is safe to create a new dataset from the original one by removing the entries with no NER labels. I have implemented this, knowing it will speed up the annotation process significantly, but i wanted to ask if there are any dangers in doing this, that i have not thought of.

Second, i found out after investigating my data further, that in most cases, i can ignore text written between quotes. I know that ideally, i should have noticed that before starting any kind of annotation, but it would help greatly if i could salvage it, at least at this point. I wrote a python script that replaces all text inside quotes with "..." (with RegEx). This will greatly decrease the length of some entries. My main problem here is that if i import my NER dataset and apply that script on each entry, i just decrease the text, without updating the spans and tokens.

To make it clear i use the following custom example.

For the text: This text has a NE1 that is described as "long text" and relates to NE2,
I have already annotated NE1 and NE2 as NE . Then, i need to annotate the relationship between them, but before that, i want to replace the "long text" with "...". I import the db, get the entries and apply my regex script on their text.

examples = db.get_dataset("myNERDataset")
for eg in examples:

Then the sentence will be: This text has a NE1 that is described as "..." and relates to NE2, but the 'start', 'end' and 'id' of each token are not updated. This will affect all the spans and especially the labels, so i need to address this issue. How would you suggest to solve this?

I hope i made my case clear and i look forward to your answer!
Thank you as always for your time!

Hi! You typically wouldn't want to remove examples with no entities because it's important for your model to also learn about texts that contain no entities. Otherwise, your model can end up struggling with those examples at runtime because it's never seen them during training. That said, you can filter them out and add them to the new dataset automatically so you're only manually annotating examples that contain entities.

If you modify the text after annotation, the token indicies and span annotations will be out-of-syn and there's kinda no easy way around that. If you're using regular expressions for the replacement, you can match them on your text and get the start/end character offset of the replacement. You can then write a script that replaces the now removed tokens with a token for ..., and then adjusts all following tokens and spans by substracting len(replaced_tex) - len("...") from the start and end offsets, and adjust the token IDs on the tokens and spans accordingly.

1 Like