Should i accept it?

damiano · April 24, 2019, 11:40pm

Hello,
midnight dilemma

I am going to create a new corpus for a NER model that will recognize organizations.

I have a db with many (> 500k+) organizations names, so, basically i have created the relative patterns to then label them via prodigy.
Now, i have a problem.
Let’s suppose a company name like ACME and another name ACME DIGITAL
My doubt is, should i ACCEPT the “ACME” suggestion if the sentence I am labelling has ACME DIGITAL ?
ACME is a company, but, in that context i should also keep “DIGITAL” too.

So the points are two:

if i accept ACME, does prodigy will never ask me again to label “acme digital” for other sentences?
if i reject it, does prodigy will never ask me to label companies with “ACME” word?

Thank you!

ines · April 25, 2019, 10:55am

The annotations you're collecting always refer to this particular context. So if the sentence is "He works at ACME DIGITAL" and Prodigy suggests you "ACME" as the ORG entity, you should always reject that. Otherwise, the feedback the model is getting is: "Yes, in this particular context, 'ACME' is U-ORG and all other conflicting spans (including 'ACME DIGITAL') are definitely incorrect." This is not what you want.

By rejecting, you're telling the model "In this context, 'ACME' is not a U-ORG entity. We know this. We don't know anything about the other possible analyses – it could be B-ORG and 'DIGITAL' could be L-ORG. So try again!"

I actually have a segment about this exact question in my tips and tricks video:

damiano · April 25, 2019, 2:33pm

Thank you @ines ! I am going to see the video.
However, is it correct to convert 500k organizations into patterns? should it will cause problem? (too much patterns?)

ines · April 25, 2019, 3:14pm

Ah, I missed that part. 500k is probably too much, yes. Do you have any frequency information about those organisations? Like, which of those are actually common and relevant, and which are not?

For the patterns, you want to be focusing on names and examples that are more common, so you have a higher chance of finding those in context and annotate as many as possible. So maybe you’d want a few hundred or thousand patterns of the most common organisations.

damiano · April 25, 2019, 10:55pm

Ok @ines
Could it be another approach to set the entities in the .jsonl directly and then confirm them via prodigy?
could it works? Basically, pre-label the sentences before processing them via prodigy…and then just accept or reject.
Maybe a custom recipe?

ines · April 26, 2019, 10:01am

@damiano Yes, that’d work, too! This is pretty much what the ner.match recipe does – only that it generates the matches in the recipe and not in a pre-processing step. But if you have your own matching logic set up, pre-processing is actually pretty good, because you can do it once and then use one static file with the mark recipe to collect accept/reject annotations.

damiano · April 26, 2019, 1:00pm

That’s perfect!
thank you so much @ines

Topic		Replies	Views
NER Training for Corporate Names ner , best-practices	22	11388	September 4, 2019
ner.teach annotations with incomplete n-grams usage , ner , solved	5	940	January 30, 2018
Manual Input of Entities to a prodigy database usage , ner , solved	5	432	July 10, 2021
Ambiguous NER annotation decisions usage , ner , solved , best-practices	12	4675	February 12, 2018
Annotation strategy for gold-standard data usage , ner , solved , best-practices	5	2708	October 26, 2018

Should i accept it?

Related topics