prodigy splitting sentences for annotation

If I am annotating entities from multiple lines ner.teach separates the annotation into separate docs.
For instance if incoming data looks like:
{“text”:"sentence 1. sentence 2. "}
Then when annotating I am presented sentence 1 and sentence 2 separately. I would like to override this and accept whatever is inside my “text” attribute.

Thanks!

Yes, by default, Prodigy splits sentences to ensure that beam search works as intended. This is done by the split_sentences preprocessor – so you can disable this behaviour by removing the following line from your recipe:

stream = split_sentences(model.orig_nlp, stream)

If you’re training an NER model and your input texts are longer than around 100-200 words, segmenting them will usually increase accuracy and performance. Instead of using the built-in sentence segmenter, you can also implement your own logic, or pre-process your input stream.

Feature request: add a command line option to ner.teach that lets you turn off sentence segmentation. Right now I’ve got a custom recipe that is an identical cut-and-paste copy of ner.teach except with that line removed.

+1

Finding the source/copying and making necessary changes to import statements is a bit of a pain. I think not having the sentence splitting should be the default and then you can write a simple custom recipe to update teach.

Most of the custom recipes I’ve written have not been cut-and-paste jobs. Usually I’m able to override a couple streams and pass the results along to the original recipe in just a few lines of Python code, which feels correct. The fact that I’m not able to do this here is indication that a command line switch is justified.

Thanks for the feedback – I see your point and we can definitely make this happen :+1: However, I would advocate against disabling sentence splitting by default, for the following reasons:

  • Beam search. As I mentioned above, the active learning-powered recipes take advantage of the beam parser, which works best on shorter inputs (and thus a lower number of possible parses to keep track of).
  • Prodigy should support loading in streams with inputs of different lengths – we want the user to start streaming in their data without preprocessing it first (at least, for simple use cases and quick experiments). And even if you do pre-process your stream, there’s always the risk of ending up with one rogue long example, which can throw off the parser and make things very slow. It’s possible to work around that – but you need to be aware of the problem.
  • Improving the model with ner.teach works best if the annotator can move through examples quickly, and focus on one entity at a time. If texts are too long, this becomes problematic, making the annotation workflow significantly less efficient.
  • In most cases, if the annotation decision needs significantly more context (whole paragraphs and more), the model is also much less likely to learn that decision. It’s still possible to make it work, but it requires some experimentation and careful thought – @wpm’s work on legal texts is a good example of that.

In terms of implementing the setting: We could either add a command-line option, or allow configuring the sentence splitting in the prodigy.json (global or project-based). This might be a little more flexible, as it’d allow defining more parameters – like a minimum character count threshold. This would let you take care of the stream pre-processing yourself, while ensuring that rogue examples won’t throw off the beam search.

Consider the following config – here, sentences will be split, but only if the example text is longer than 2000 characters:

{
    "split_sents": true,
    "split_sents_threshold": 2000
}
Setting Type Description Default
split_sents bool Enable or disable sentence splitting. true
split_sents_threshold int Minimum character count for sentence splitting, if splitting is enabled. If false, sentences are always split. false

Yes. Splitting on sentences should definitely be the default, but it should be easy to customize this.

One thing I’m finding with the project I’m currently using Prodigy for is that segmentation is a big challenge. In a lot of the academic literature you don’t even think about the segmentation. It’s just a tweet, or a sentence, or an IMBD review. The notion of segment is part of the definition of the problem. In the real world task I’m working on, discovering the segments themselves is part of the work. That’s why it’s good for me to have an easy way to have Prodigy present entire documents. Not because I want to throw a long document up in the Prodigy interface, but because I need to preprocess my documents with custom segmentation code, and putting that flexibility into the recipe command line seems like a bad idea. It’s better if I just write a separate preprocessing step in Python.

Definitely! This is why spaCy usually does sentence segmentation as a post process, once the syntactic parse is available. This lets it give the whole input into the statistical models, so they can train from the raw text. Unfortunately the need for beam search for the training makes working on the raw text difficult, if the text is long.

Adding an option to disable the rule-based sentence segmentation will be a nice change for short texts.

Right, though here when I say “segments” I don’t necessarily mean sentences. I mean something more like “the most salient context for the task at hand”. This is always smaller than an entire document (which can run up to twenty pages) but often longer than a sentence. For now, pattern matching to split things into paragraphs works pretty well, but ultimately this may need to be framed as an attention mechanism.

Good day,
My jsonl file, called

Lease-7.jsonl

, looks like this


So when I run in the gitbash cmd line as follows
prodigy ner.manual news_headlines en_core_web_sm Lease-7.jsonl --lab end, renew, payment, term, lessor, lessee", then I see in the UI webserver as follows

I believe I want to view less text per click, in order to not make the model inaccurate, by taking into account irrelevant context from too far away from the entities I am labeling.
So, please, how do I set the amount of displayed text per annotation click to be less?
I tried changing the setting split_sents_threshold in the prodigy.json file, of which I have a copy in my home directory and also in my local project's folder (is that superfluous?).

But changing it to 2000, for example, made no noticeable difference. Please let me know if you need more information.
Thanks for the great service. Keep up the great work!
Best wishes,
Yishai

Hi! The ner.manual recipe doesn't split sentences – that's why the split_sents_threshold has no effect here. There are two options to split your texts:

  1. Update the recipe in recipes/ner.py (or use this template to write your own custom version of ner.manual). The split_sentences preprocessor takes an nlp object that can split sentences (either with a parser or the sentencizer) and your stream. For example:
from prodigy.components.preprocess import split_sentences

# after your stream is loaded etc.
stream = split_sentences(nlp, stream)
  1. Preprocess your JSONL file in Python and use spaCy to split sentences. Then save it to a new file and use that in Prodigy. For example:
import spacy
import srsly  # to easily read/write JSONL etc.

nlp = spacy.load("en_core_web_sm")  # or whatever you need
examples = srsly.read_jsonl("./Lease-7.jsonl")
texts = (eg["text"] for eg in examples)

new_examples = []
for doc in nlp.pipe(texts):
    for sent in doc.sents:
        new_examples.append({"text": sent.text})
srsly.write_jsonl("./Lease-7-with-sentences.jsonl", new_examples)

Thank you for this reply! I am looking for recipes/ner.py in order to change it. Where should it be located?

Ah, sorry if my comment was unclear! You can find the location of your Prodigy installation in prodigy stats or by running this little one-liner:

python -c "import prodigy;print(prodigy.__file__)"

Just hacking it into the built-in recipe should be fine for testing, but if you like the solution, you might want to write your own custom recipe using this template.

I see you used the doc.sents. For my purposes, that splits the text into sections that are a bit too small. Is there any chance you have an attribute called doc.paragraphs?
The reason is because I want to annotate and allow the mode to incorporate the appropriate amount surrounding context for each entity we wish to classify. Do you think 200 characters is a good amount for that? On the other hand, I don't want to determine it purely based on the number of characters, because that could cut sentences off in the middle.
Rather, I would want to present, let's say, any sentence that is at least 400 characters long.

We don't have a doc.paragraphs attribute, no. I think the best solution for you will be to split up your text as a preprocess. You can split it up however you like, and record an ID in the meta field that marks the original document ID, so you can reconstruct things later. The input to Prodigy just needs to be a jsonl file, so it's easy to process the data beforehand.