Yes, the sentence splitting is done in this simple preprocessor:
stream = split_sentences(model.orig_nlp, stream)
So if you remove this from
ner.teach, or don’t include it in your custom recipe, the tasks will just be streamed through without modification. So you can choose to split up your documents however you want when you create the stream of examples.
Yes, this is exactly what the tasks
"meta" is for! It’s a dictionary, and each entry will be rendered in the bottom right corner of the annotation card, with the key used as the bolded label. For example:
"text": "Some long text here",
"document": "Super Important Legal Document"
This will show up as:
PARAGRAPH: 632 DOCUMENT: Super Important Legal Document
It should also be very easy to generate the meta programmatically when you’re reading in and pre-processing the documents to create your stream.
The built-in recipe that’s currently using the manual interface is
ner.manual. It only uses the model for tokenization and doesn’t do any active learning. So this might be useful if you’re starting at zero and want to collect a few examples of the category you can’t easily cover by seed terms and patterns. It’s also useful to create gold-standard data for evaluation – so, before you start training with a model in the loop, you could annotate a few hundred paragraphs manually and add them to an evaluation dataset. This will make it much easier to reliably evaluate the model’s performance later on.
Btw, I hate teasing future features, BUT we’re also just working on a new version of the
ner.make-gold recipe that combines the manual NER mode with a model’s entity recognizer. So you’ll be able to see all entities the model recognised in a text and correct them – either by removing predictions ore adding new ones. It’s already been working well in our tests, so we’ll likely be ready to push another release soon