HTML to jsonl and NER task workflow

Hello,

I’m a spacy+prodigy newbie and trying to get my head around some basic concepts, as well as to apply some of the answers here to my current project. This support forum was a great help so far, but I would appreciate a little boost/correction, because I’m under the impression that I might overcomplicate things.

My task is very similar to the one discussed here (NER document Labeling).

TLDR: Company imprint HTML sites as input -> NER for entities like companyname, address, phone and so on

What (I think) I know, so please correct me:

  • For this specific problem, using a blanc model is advised (this one helped me a lot, I need it for German, there it is explained for a Russian example: Trying to teach NER from blank model for Russian language).

  • I have no annotated training data as a base, but I can use prodigy with ner.manual, to generate one. Afterwards I use this annotation data to batch-train and save the model. Then, I can use this model, to use ner.teach (maybe constrained on only one of my labels, right?) to expand my training data a bit faster, because I only have to perform binary decisions.

  • Prodigy expects jsonl as an input format. As mentioned in the cited post, I could plug in each line (separated by ‘\n’) as a separated entry and use a form like {“text”: " XXXXXXX ", “meta”: {“source” : “company1.txt”}}

What remains rather unclear to me:

  • I can clean the html file in a way, that several interesting paragraphs remain. For the whole document imagine something like:
\n
Some unimportant stuff\n
Stuff\n
\n
\n
\n
Companyname\n
Street No. 3\n
12345 City\n
\n
\n
Stuff\n
\n
\n
Phone Number:\
0123 012654
\n

Because of this structure, I would guess it is more useful to feed these “information blocks” to the text property of the jsonl file instead of every single line. So one entry (that will be presented to me in prodigy during manual annotation) might be:

“Companyname\nStreet\n1234 City”

I assume even a blanc model would further split this into sentences? I will only ensure that the model can learn from structures like “Ah, a Streetname+Number is often followed by a PLZ+City” or “Ah, if there is a Number preceded by “Tel.” or “Telefon:” it will likely be a phone number. I’m not sure if I break this concept, when putting in a bunch of one or two-word sentences that are generated when only using a newline approach.

Am I right with this? Or is there a more elegant way in spacy to do this, like introducing a custom/artifical paragraph stopword, like ‘\n\n\n’?

  • Speaking of elegant ways: Whereas some of my desired entities will profit from manual annotation (Companynames, Persons…) others might allow for some rule-based annotation approach (like for phone numbers, or maybe using a dictionary for cities). If I remember correctly, some of the ner.recipes allowed using a pattern.jsonl file which will be applied (in form of an EntityRuler?) before the statistical model does any own NER, right? I would greatly appreciate your opinion and a roughly sketched workflow here.

  • Assuming the above topics are handled somehow, does my training/annotation workflow make sense? 1. Generate the magix jsonl file from my company htmls, to do manual annotation of all my labels (that can’t ne addressed by patterns) in prodigy 2. Use the saved annotation dataset for batch-train and saving the model 3. Use ner.teach and select one of my labels at a time to further annotate via binary decisions 4. Again a batch-train to save to a hopefully resonably functioning model

I know, this is a lot to take in by being a confusing mix of fundamental and rather detailed questions, but I would be very grateful for your help!

Greets
KS

Hi Kevin,

Yes, I think starting from a blank model will be best for your use-case. You might also benefit from moving to a custom recipe early, so that you have full control of the processing steps. I would concentrate on using the manual view at first, probably with a rule-based process suggesting annotations, and not worry too much about stuff like ner.teach. The ner.teach recipe requires the model to be reasonably accurate already, because it uses the model to suggest specific annotations, which you say yes or no to. If the model isn’t already somewhat accurate, the suggestions aren’t very helpful. You can experiment with ner.teach later once you have a model that’s fairly accurate, but for now I would concentrate on the basics of setting up a manually annotated training, testing and evaluation set.

Regarding your question about sentence segmentation, you won’t get any sentence splitting if you don’t add it to the pipeline. So if you want the NER to work on your chunks without further segmentation, you can definitely have it do that. I would say this is a good option for you.

Hi Matt,

thank you for your reply and your remarks regarding the ner.teach recipe. Maybe you could evaluate your custom recipe suggestion a bit more? From my understanding I would do the following:

  • use ner.manual for the labels, that can hardly be described by rules (like person or company names)
  • use ner.match to annotate the other labers faster because of only binary decisions (pattern accept/reject). I can plug in patterns for mutiple labels here, via a jsonl file, right?

My questions is: How would one reasonably combine these two steps into one recipe? Some abstract hints how to inject the rule-based suggestions will totally suffice to comb through the documentation and the forum.

And a second more conceptional question regarding the html document splitting: Even with my ‘block-wise’ sentencing method, in some cases instead of a small block like “Companyname, Street no.3, 12345 City” the algorithm will only get “City” as the one and only word in this ‘sentence’ that might be preceded with another sentence “12345”.
I’m confused if the model will still recognize (after enough training) the pattern here.

In my initial dataset jsonl I used the {“meta”:{“source”: }} scheme that was suggested here and I see that this information is displayed to me during annotation. I just want to be sure that these lines are considered to belong to ONE document (as sents for example)… or am I missing a point here?

Thank you for your help!

Well, all you need to do to pre-highlight suggestions is to add a "spans" entry to the task objects. So in your custom recipe, just have your generator producing your stream of examples, and inside the loop run the matcher or some other process to predict the suggested spans. You can find example custom recipes here: GitHub - explosion/prodigy-recipes: 🍳 Recipes for the Prodigy, our fully scriptable annotation tool

I don't think I understand this bit sorry? You're using your own logic to split up the documents, right? If so, I'm not sure what the question is --- if your logic over-segments, then you should be able to fix that, right?

1 Like

Hi Matt,

thank you very much for your fast answer and I would kindly ask you to bear with me a last time (I hope).

ah ok, that makes sense. The ner.make-gold recipe looks promising as a building base to add the custom matcher. I will definetly try this!

In principal, yes. The problem is, that in my case, it is not trivial to perform this segmentation because of the variety of website designs. So, sometimes I have my ideal information block as one example segment, sometimes the crucial information will enter one by one in separate examples.

In short the question was more of a "Is it worth the time of improving the segmentation, or how much will the algorithm still benefit from (sometimes) the 1-2 word examples only consisting of a PLZ or cityname for example?"

I know, without surrounding structure my pattern-based augmentation will be limited for this cases...

I don't want to stretch your patience with me but regarding this last aspect I will like to come back to the initial workflow question: I guess it would it be wiser to plug in my (roughly cleaned, stripped of scripts etc.) html documents AS A WHOLE into my jsonl training dataset "text"-entries and embed the segmentation in my model?

So far, I used a blank model, only with tokenization and the added NER. Please correct me if I'm wrong, but my possibilties here seem to be:

  • I could add a sentencizer or a custom component to split my html-doc into my information segments as sentences in a rule based fashion

  • I could add a parser to base the sentence/segment selection on. But then I still can't use one of your prebuilt models, because I don't have a "normal" syntactic structure in my documents and would have to manually train one... so I don't think this would be the way to go here...

One last question:
If my model would manage to split the docs in sentences, would prodigy display these for annotation tasks or still the whole doc? In the end, I want to have a model that I can give a whole document and get the NER, so I'm still confused which step to add to the model and which step to a custom recipe.

Cheers
Kevin

Ah, I think I understand better now.

Your training process will definitely suffer if you have a lot of segmentation errors in your data. My suggestion would be to find a very conservative segmentation policy that very rarely over-segments, so you can work on somewhat more convenient chunks, without introducing errors from relevant entities being split over documents. You can also try just not segmenting the texts. You’ll have to scroll a little bit, but it should still work.

One thing to keep in mind is that you’ll need some way to create entirely correct annotations, for evaluation. The evaluation set shouldn’t have errors in it from the segmentation — otherwise you’ll have no way of estimating how many errors the segmentation is actually introducing. You’ll also want a dataset that lets you evaluate future efforts to improve the segmentation.

If you can’t easily make rules that don’t over-segment, and it’s impractical to work on the unsegmented documents, one solution is to have a fix-up process that you do at the end, possibly just in a text editor. You could use the “reject” button to mark that there’s a segmentation error, and then have some sort of process to repair the problems when you’re done annotating. The repair process could be manual or a script, depending on the nature of the problem and how you’ve managed to mark it.

Thank you Matt, this cleared up my general workflow questions!