I have used Prodigy to annotate and train a Named Entity Recognition model.
Now, i want to use that dataset as my input data to annotate and train on top of it a custom Relation Extraction model (a mix of RelEx and Coref).
So i use rel.manual myRelDataset dataset:myNERDataset --label myLabels (which i have tried and works as intended). I see the data with the correct Named Entities, so all i need to do is, annotate the relationships between them, when needed (no need to use --span-label).
Before moving forward, I decided to do some processing on the data to facilitate the annotation.
First, i assume that since i only want relationships between Named Entities, it is safe to create a new dataset from the original one by removing the entries with no NER labels. I have implemented this, knowing it will speed up the annotation process significantly, but i wanted to ask if there are any dangers in doing this, that i have not thought of.
Second, i found out after investigating my data further, that in most cases, i can ignore text written between quotes. I know that ideally, i should have noticed that before starting any kind of annotation, but it would help greatly if i could salvage it, at least at this point. I wrote a python script that replaces all text inside quotes with "..." (with RegEx). This will greatly decrease the length of some entries. My main problem here is that if i import my NER dataset and apply that script on each entry, i just decrease the text, without updating the spans and tokens.
To make it clear i use the following custom example.
For the text: This text has a NE1 that is described as "long text" and relates to NE2,
I have already annotated NE1 and NE2 as NE . Then, i need to annotate the relationship between them, but before that, i want to replace the "long text" with "...". I import the db, get the entries and apply my regex script on their text.
examples = db.get_dataset("myNERDataset") for eg in examples: eg['text']=myRegex(eg['text'])
Then the sentence will be: This text has a NE1 that is described as "..." and relates to NE2, but the 'start', 'end' and 'id' of each token are not updated. This will affect all the spans and especially the labels, so i need to address this issue. How would you suggest to solve this?
I hope i made my case clear and i look forward to your answer!
Thank you as always for your time!