I need some help with choosing and customizing an annotation interface.
I am working with longish (2-10 pages) legal documents. I need to extract names of the parties the contract applies to. The parties are names of corporations. Sometimes there will be a sentence that says something like “This contract is between ACME corporation and Big Industries Inc.” Other times there will be a sentence that says something like “This contract is between ACME corporation and the companies listed in the appendix of this document.” Then in the appendix there will be a bullet list of several company names. I want to be able to churn through a large number of documents and come out with span annotations that look like "This contract is between
ACME corporation and
Big Industries Inc."
This is basically a named entity extraction task, but it is a little non-standard because I’m not trying to extract every company name in the document. There will be some company names that are not parties to the contract. There are contextual cues that allow a human tell the difference, and I need my model to learn those.
I’m running the
ner.teach recipe with seed patterns to find likely company names. There are a couple problems with the default way that Prodigy has me do this annotation task that I’d like to rectify by writing a custom recipe.
- I need to see more than just the default one-sentence context that Prodigy supplies. Entire paragraphs would be helpful. Or maybe even the entire document.
- If we’re looking at just a portion of the document it would be helpful to have an approximate sense of where we are. Something like a line or paragraph number displayed in the Prodigy UI.
- I also need to see the document title when I’m annotating, because that can be very helpful to the annotator.
I think I can change the amount of displayed context in (1) by returning the right kind of
stream object from my custom recipe, but I’m not sure. Is there example code for this?
(2) and (3) are basically metadata: data that should be visible to the annotator that is not part of the literal text. Is this supported by recipes?
Also would the boundaries interface or the new fully manual NER annotation interfaces be better for this task than
ner.teach? Are there example recipes that run these?