Training new model using annotations from ner.manual

Hi,

I’m using customized ner.manual recipe to annotate sentences in the article which I think are the most relevant. I have a dataset with the annotations (that I get after using my recipe) but I’m not sure how can I use it to train the en_core_web_lg model so it can pick for me if the article is important or not, and if it is suggest the relevant sentences?

I tried using terms.to-patterns to create pattern file, but it returns the whole text of the document rater than the selected sentences.

If you’re annotating at the sentence level, you should really try using the textcat recipes instead of the NER recipes. The NER model doesn’t really have good features for the internal structure of the span — it reads the text one word at a time, and tries to predict the start and end positions.

After you’ve trained the model to predict the importance label, you can then use it to filter sentences for further processing, e.g. if you have more entity type labels you want to apply.

We’re planning to write a more detailed guide about this, but I think a common difficulty people have with NLP is that a lot of the most important decisions are about how to break your application needs down into some set of learning problems. For instance, in your case you’ve seen that you need labelled spans, where the spans are sentences and the label is “is it important?”. The NER model takes text and outputs labelled spans – so it seems like a natural choice! However, there are other possible solutions, especially by composing rule-based and ML approaches.

If you know the spans will be sentences, then the learning problem can be much easier if you impose that constraint upfront, since sentences are reasonably easy to detect. This makes the learning problem much easier: the model only has to learn the importance label, instead of having to learn the label jointly with the definition of a sentence. The joint approach takes a lot more data to train, because the model has to get two pieces of information correct at once. While it’s still making errors on the sentence boundaries, it can’t tell that it’s making progress on the importance labelling, and vice versa.

Thank you for your reply!

I’m still not entirely sure how should I approach my problem.

I’m only using 1 label, lets say IMPORTANT and I have a bunch of articles that contain certain key words that might have more than one meaning, such as “target”. At the moment I’m using ner.manual to display the whole body of the article. If article relates to “target” as a company and contains some relevant informations (e.g. latest profits etc. ) I label the relevant sentences (not the whole text) as IMPORTANT and chose accept, however when I come across irrelevant document (regarding “target” in other context than company name) I simply click reject. I would like to be able to train the model so in the future it can filter the relevant articles automatically and pick the important sentences for me.

Do I understand you correct, should it be split into 2 separate tasks? First picking the relevant document and then picking the important sentence?