I need some help with choosing and customizing an annotation interface.
I am working with longish (2-10 pages) legal documents. I need to extract names of the parties the contract applies to. The parties are names of corporations. Sometimes there will be a sentence that says something like “This contract is between ACME corporation and Big Industries Inc.” Other times there will be a sentence that says something like “This contract is between ACME corporation and the companies listed in the appendix of this document.” Then in the appendix there will be a bullet list of several company names. I want to be able to churn through a large number of documents and come out with span annotations that look like "This contract is between ACME corporation and Big Industries Inc."
This is basically a named entity extraction task, but it is a little non-standard because I’m not trying to extract every company name in the document. There will be some company names that are not parties to the contract. There are contextual cues that allow a human tell the difference, and I need my model to learn those.
I’m running the ner.teach recipe with seed patterns to find likely company names. There are a couple problems with the default way that Prodigy has me do this annotation task that I’d like to rectify by writing a custom recipe.
I need to see more than just the default one-sentence context that Prodigy supplies. Entire paragraphs would be helpful. Or maybe even the entire document.
If we’re looking at just a portion of the document it would be helpful to have an approximate sense of where we are. Something like a line or paragraph number displayed in the Prodigy UI.
I also need to see the document title when I’m annotating, because that can be very helpful to the annotator.
I think I can change the amount of displayed context in (1) by returning the right kind of stream object from my custom recipe, but I’m not sure. Is there example code for this?
(2) and (3) are basically metadata: data that should be visible to the annotator that is not part of the literal text. Is this supported by recipes?
Also would the boundaries interface or the new fully manual NER annotation interfaces be better for this task than ner.teach? Are there example recipes that run these?
Yes, the sentence splitting is done in this simple preprocessor:
stream = split_sentences(model.orig_nlp, stream)
So if you remove this from ner.teach, or don't include it in your custom recipe, the tasks will just be streamed through without modification. So you can choose to split up your documents however you want when you create the stream of examples.
Yes, this is exactly what the tasks "meta" is for! It's a dictionary, and each entry will be rendered in the bottom right corner of the annotation card, with the key used as the bolded label. For example:
{
"text": "Some long text here",
"meta": {
"paragraph": "632",
"document": "Super Important Legal Document"
}
}
This will show up as:
PARAGRAPH: 632 DOCUMENT: Super Important Legal Document
It should also be very easy to generate the meta programmatically when you're reading in and pre-processing the documents to create your stream.
The built-in recipe that's currently using the manual interface is ner.manual. It only uses the model for tokenization and doesn't do any active learning. So this might be useful if you're starting at zero and want to collect a few examples of the category you can't easily cover by seed terms and patterns. It's also useful to create gold-standard data for evaluation – so, before you start training with a model in the loop, you could annotate a few hundred paragraphs manually and add them to an evaluation dataset. This will make it much easier to reliably evaluate the model's performance later on.
Btw, I hate teasing future features, BUT we're also just working on a new version of the ner.make-gold recipe that combines the manual NER mode with a model's entity recognizer. So you'll be able to see all entities the model recognised in a text and correct them – either by removing predictions ore adding new ones. It's already been working well in our tests, so we'll likely be ready to push another release soon