I am going to create a new corpus for a NER model that will recognize organizations.
I have a db with many (> 500k+) organizations names, so, basically i have created the relative patterns to then label them via prodigy.
Now, i have a problem.
Let’s suppose a company name like ACME and another name ACME DIGITAL
My doubt is, should i ACCEPT the “ACME” suggestion if the sentence I am labelling has ACME DIGITAL ?
ACME is a company, but, in that context i should also keep “DIGITAL” too.
So the points are two:
- if i accept ACME, does prodigy will never ask me again to label “acme digital” for other sentences?
- if i reject it, does prodigy will never ask me to label companies with “ACME” word?
The annotations you’re collecting always refer to this particular context. So if the sentence is “He works at ACME DIGITAL” and Prodigy suggests you “ACME” as the
ORG entity, you should always reject that. Otherwise, the feedback the model is getting is: “Yes, in this particular context, ‘ACME’ is
U-ORG and all other conflicting spans (including ‘ACME DIGITAL’) are definitely incorrect.” This is not what you want.
By rejecting, you’re telling the model “In this context, ‘ACME’ is not a
U-ORG entity. We know this. We don’t know anything about the other possible analyses – it could be
B-ORG and ‘DIGITAL’ could be
L-ORG. So try again!”
I actually have a segment about this exact question in my tips and tricks video:
Thank you @ines ! I am going to see the video.
However, is it correct to convert 500k organizations into patterns? should it will cause problem? (too much patterns?)
Ah, I missed that part. 500k is probably too much, yes. Do you have any frequency information about those organisations? Like, which of those are actually common and relevant, and which are not?
For the patterns, you want to be focusing on names and examples that are more common, so you have a higher chance of finding those in context and annotate as many as possible. So maybe you’d want a few hundred or thousand patterns of the most common organisations.
Could it be another approach to set the entities in the .jsonl directly and then confirm them via prodigy?
could it works? Basically, pre-label the sentences before processing them via prodigy…and then just accept or reject.
Maybe a custom recipe?
@damiano Yes, that’d work, too! This is pretty much what the
ner.match recipe does – only that it generates the matches in the recipe and not in a pre-processing step. But if you have your own matching logic set up, pre-processing is actually pretty good, because you can do it once and then use one static file with the
mark recipe to collect accept/reject annotations.
thank you so much @ines