ner.teach does not exclude dataset even after '--exclude'


(Arulmozhi Palanisamy) #1

I am trying to label another round of data with the existing training dataset using ner.teach. I already have one set annotated in dataset “training_1” (silver). My input file has a lot of text data in csv which was used as input for “training_1” (a part of it was done in first round). Now, when i use this command with these args, prodigy should consider the text that is not in ‘training_1’. But in the interface, i am getting the text that was already labeled in ‘training_1’ dataset.

prodigy ner.teach training_2 trained_models_spacy long_text_train.csv --label Labels.txt --patterns Prodigy_Patterns.jsonl --exclude training_1

I dont know why this is not working. Should i match and drop the already tagged text before i give input as csv?

(Ines Montani) #2

Hi! Are you sure the examples you’re seeing are actually the same examples? So, the same span suggestion on the same text? The thing is, the exclude mechanism will only look at identical examples to ensure that you’re never annotating the same question twice. But if there’s a different question on the same text – for example, with a different entity span suggested – you’ll still get to see that example, because it’s a different question.

If you don’t want to include examples with texts you’ve already annotated something on, you could write your own stream filter that gets the input hashes from the dataset and only sends out examples with a different input hash. For details, check out the docs on the filter_inputs helper in your PRODIGY_README.html.

(Arulmozhi Palanisamy) #3

Ohh, ok. The entity suggestions now are different. But my original corpus (from which i got the model in the loop) is manually corrected using ner.teach before. Now i am using the improved model in the loop. Do you suggest adding more corpus that excludes these texts or do you suggest looping again with the same text with the correcting different labels suggested?

(Ines Montani) #4

I guess it depends on how much raw data you have! If you have a lot more raw text, then yes, maybe you can try showing it something else instead of looping over the same data again. If you only have limited examples, then I’d say it’s okay to start at the beginning again. When you first annotated the data, you also didn’t get to see every example. The active learning will skip examples and only show you the most relevant ones. When you loop over it again with a different model, you might also see different suggestions.

(Arulmozhi Palanisamy) #5

Thank you. I have a good amount of raw data. I will exclude these texts and give it new text.