Annotated Dataset and NER task with Prodigy

hi @sudarshan85!

Sorry for the delay. We're trying to close out old tickets.

By default, ner.correct does sentence segmentation (unlike ner.manual. You can turn it off by adding --unsegmented.

That's tough to confirm. Let me go through a reproducible example of what should happen.

Start with this source file:
nyt_text_dedup.jsonl (18.5 KB)

Step 1: Label 10 records into dataset ner_correct1

python -m prodigy ner.correct ner_correct1 en_core_web_sm nyt_text_dedup.jsonl --label LOC

I then labeled the first 10 records. You can see them by running:

$ python -m prodigy print-dataset ner_correct1

Step 2: Rerun but use --exclude to exclude records in ner_correct1

python3 -m prodigy ner.correct ner_correct2 en_core_web_sm data/nyt_text_dedup.jsonl --exclude ner_correct1 --label LOC

Notice it starts on record 10 (see metadata in bottom right). Therefore, it skipped the first 10 records.

Yes. ner.manual and ner.correct will use based on order of documents. This is different than ner.teach, which uses active learning and will alter the order of the documents based on uncertainty scoring.

Let us know if you have any other questions!