Hello,
I’m a spacy+prodigy newbie and trying to get my head around some basic concepts, as well as to apply some of the answers here to my current project. This support forum was a great help so far, but I would appreciate a little boost/correction, because I’m under the impression that I might overcomplicate things.
My task is very similar to the one discussed here (NER document Labeling).
TLDR: Company imprint HTML sites as input -> NER for entities like companyname, address, phone and so on
What (I think) I know, so please correct me:
-
For this specific problem, using a blanc model is advised (this one helped me a lot, I need it for German, there it is explained for a Russian example: Trying to teach NER from blank model for Russian language).
-
I have no annotated training data as a base, but I can use prodigy with ner.manual, to generate one. Afterwards I use this annotation data to batch-train and save the model. Then, I can use this model, to use ner.teach (maybe constrained on only one of my labels, right?) to expand my training data a bit faster, because I only have to perform binary decisions.
-
Prodigy expects jsonl as an input format. As mentioned in the cited post, I could plug in each line (separated by ‘\n’) as a separated entry and use a form like {“text”: " XXXXXXX ", “meta”: {“source” : “company1.txt”}}
What remains rather unclear to me:
- I can clean the html file in a way, that several interesting paragraphs remain. For the whole document imagine something like:
\n
Some unimportant stuff\n
Stuff\n
\n
\n
\n
Companyname\n
Street No. 3\n
12345 City\n
\n
\n
Stuff\n
\n
\n
Phone Number:\
0123 012654
\n
Because of this structure, I would guess it is more useful to feed these “information blocks” to the text property of the jsonl file instead of every single line. So one entry (that will be presented to me in prodigy during manual annotation) might be:
“Companyname\nStreet\n1234 City”
I assume even a blanc model would further split this into sentences? I will only ensure that the model can learn from structures like “Ah, a Streetname+Number is often followed by a PLZ+City” or “Ah, if there is a Number preceded by “Tel.” or “Telefon:” it will likely be a phone number. I’m not sure if I break this concept, when putting in a bunch of one or two-word sentences that are generated when only using a newline approach.
Am I right with this? Or is there a more elegant way in spacy to do this, like introducing a custom/artifical paragraph stopword, like ‘\n\n\n’?
-
Speaking of elegant ways: Whereas some of my desired entities will profit from manual annotation (Companynames, Persons…) others might allow for some rule-based annotation approach (like for phone numbers, or maybe using a dictionary for cities). If I remember correctly, some of the ner.recipes allowed using a pattern.jsonl file which will be applied (in form of an EntityRuler?) before the statistical model does any own NER, right? I would greatly appreciate your opinion and a roughly sketched workflow here.
-
Assuming the above topics are handled somehow, does my training/annotation workflow make sense? 1. Generate the magix jsonl file from my company htmls, to do manual annotation of all my labels (that can’t ne addressed by patterns) in prodigy 2. Use the saved annotation dataset for batch-train and saving the model 3. Use ner.teach and select one of my labels at a time to further annotate via binary decisions 4. Again a batch-train to save to a hopefully resonably functioning model
I know, this is a lot to take in by being a confusing mix of fundamental and rather detailed questions, but I would be very grateful for your help!
Greets
KS