I have around 70 documents each with 15 pages of text in average. I want to extract self-defined entities to create a knowledge base for each of the documents. Preferably I would also like to train a model that extracts the same entities for future documents without manually labeling again.
As you can see 9 of my 15 entities basically overlap with spaCys pre-defined entities. However consulting @ines flowchart, I concluded that since i) I can not find good patterns to match my entities and ii) I add more than 3 entities, I should use ner.manual and manually annotate all 15 entities.
Right now I upload the 70 raw documents and label over the long texts in prodigy.
Is there a difference, labeling over the very long documents, or feeding in the document on sentence level?
Can I use ner.manual to annotate over 3-4 documents and then use the annotations for ner.teach, like using the annotations as a patterns file to overcome ner.teach’s cold start?
Let’s say I am satisfied with my customized NER model and I have stored the named entities per document. Is there any way I can map back from the entities to the sentence I extracted them from? For instance, I display all paragraphs that contains named entity "x1, x2 and x3" and then use something like Python's "Whoosh" to query them for keywords. I think this would be useful for building a search engine.
I think you have the right ideas about your workflow here, so these are all great questions. I hope you'll be able to complete your task quickly.
I think you should consider writing a script to divide your documents into paragraph level texts, maintaining the document IDs and paragraph IDs in the records you create --- possibly in the meta field (which would display the information to you during annotation). If you can host the documents, you could also add a link to the document and paragraph within the meta field, so that during annotation you can see the paragraph in context.
While dividing up the data this way, for extra safety you might also want to maintain an index mapping the paragraph hashes to the document and paragraph IDs outside of the dataset --- so that you know you can't lose the information. You can use the prodigy.util.set_hashes function to assign the content and task hashes. The prodigy.util.INPUT_HASH_ATTR constant provides the key that the input hash will be assigned to in the example.
To answer your other questions: yes, after you've done some manual annotation, you could try using ner.batch-train to train a model, and then try either ner.teach or ner.make-gold.
For your last question, I think this would just be a matter of storing the right metadata when you extract the entities from the dataset --- so you'd just do this in a custom Python script, I think. If you're using spaCy for the data manipulation, you can get the sentence of an entitity with the sent property. It doesn't give you an index into the list of sentences though, but it's easy to build that yourself, probably as a dict keyed by the start of the sentence, e.g. sentences = {sent.start_char: i for i, sent in enumerate(doc.sents)}. So that you can just do sentences[ent.sent.start_char] to retrieve the sentence index of a given entity.
Finally, an additional suggestion. Given that many of the entities you're interested in do overlap with spaCy's built-in entities, you could try getting those annotated on your data first, before you work on your extra entities. This should make the annotation easier. You could use the ner.make-gold recipe for this.
Apply model to ner.teach and/or ner.make-gold to increase accuracy. Use --exclude safety_new to get new examples. I am not sure if I should use the original data “data.json” or export and use the annotated date from step 1?
Again, train improved model with ner.batch-train with new annotated dataset db_2. I am not sure whether to use en_core_web_lg or the model from before /tmp/model ?
Export db_final with annotations to db_final.jsonl file for next phase.
Export model_final.
Phase 2:
8. Start annotate own entities with ner.manual using db_final.jsonl from previous step. This also gives me the nice opportunity to change or delete wrong previous annotations I suppose?
Export labeled data-set and re-use it with new entity to annotate. Iterate for all own entities until we have all old and new entities annotated in one data-set.
Use ner.batch-train to train with this data-set.
Apply new model to ner.teach and/or ner.make-gold to increase accuracy.
Now we have both a model for future use in this domain + annotated data for knowledge base
For step 3) is it better to use the original input data data.json or use the annotated data from step 1)?
For step 4) do I re-train the model for en_core_web_lg or the the model from the first round in step 2) /tmp/model
For step 10) using my own entities on top of spacy’s entities. I suppose I have to train on model_final from step 7)?
If you could clarify those last questions, I'd be very grateful.
Thank you!
Both ner.teach and ner.make-gold will add suggestions to the raw text, so there's no advantage in using annotated data here. Also, you typically want to use new texts that the model hasn't seen to improve it further, not annotations it was already trained on.
When annotating with a model in the loop, you can use the latest artifact you've created with ner.batch-train.
When you train the final model model, you usually want to start with the base model and train with all collected annotations (instead of updating one artifact in small increments). This lets you avoid possible side-effects from the frequent updates and gives you a cleaner and more easily reproducible process.
My task is also similar to this one, so I would like to make use of some of the spacy entities (person, law & organisation), but then add totally new ones (policy mix & sector...).
I am planning to follow this workflow, but one thing in @ines's response is not clear.
Do you mean that the model_final created at the step 7 should be used to train after data annotated for all other entities at the step 9? Or do you mean a base model (in this example's case en_core_web_lg) should be trained at once with the annotated data that is created through the phase 1 & 2?
Thanks a lot in advance, this place has been already very useful!
Yes, exactly. Ideally, you should always start with the same clean slate or base model, and then update it from all annotations (instead of doing it in multiple steps).