How to split the paragraph into sentences after annotation

I import a paragraph into prodigy and use ner.manual to annotate. I didn't split the paragraph first because I need the context to help me to identify entities. Now I want to split the paragraph with annotations into several independent sentences with labels I annotated and remove the sentences without labels. It seems like I want to split a long annotated sample into several short annotated samples.

1 Like

Hi Wentao,

what recipe are you using? As explained here, the ner.teach and ner.correct recipes both split the sentences but both also offer a --unsegmented flag to ignore this behavior.

I'm currently dealing with a very similar situation.

Indeed, as described in the official Prodigy docs at Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP,

For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. ... NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.

But also, just like @wentao-uw wrote

I need the context to help me to identify entities.

And I don't think they are contradictory: Sometimes a domain expert might want to see the whole context, and then zoom into a few single sentences, annotate a few phrases in them. When saved as such, this will be a single data item in Prodigy. But if there was an automated way to take those sentences, it would be more realistic and useful: a few sentences with a few annotations. In the case, let's say 10 pieces of long text items, this can mean having 10 data items for training, versus (assuming each had at least 5 sentences with annotations) having 50 data items for training, each item being a single sentence with annotations with a narrow contextual window as suggested by the Prodigy documentation.

In other words, the following would be useful, am I wrong?:

  • For the domain expert wishing so: show the whole long text to help expert decide.
  • The domain expert annotates text (e.g. with entity types).
  • Extract the annotated (sentences), so that training a custom NER, using narrow contextual windows becomes more feasible (compared to running training directly on the long pieces of annotated text).
1 Like

As Emre said, this is exactly what I want.
My solution is:

  1. use ner.maunal annotates the long text
  2. export annotation as jsonl file
  3. write a python script to convert jsonl format to CSV format. In the CSV file, each record only has one sentence and one span. If the sentence has multiple spans, it will relate to multiple rows in the CSV file.
  4. the schema of CSV contains text, start index, end index, entity name, and label. These will be useful for converting CSV back to jsonl files.
  5. convert the CSV to the list of tasks. task = {"text": put your text here, "spans":[{"start": start index, "end": end index, "label": label name}]}
  6. use split_sentences to split the sentence and only save the split sentence with the span.
from prodigy.components.preprocess import split_sentences
from prodigy.util import write_jsonl
import spacy
nlp = spacy.load("en_core_web_sm")
stream = split_sentences(nlp, [list_of_task], min_length=30)
lst = []
for s in stream:
    if len(s['spans']) != 0:
write_jsonl("fileName.jsonl", lst)

I do expect this will increase my model performance but it didn't. Now I have more annotated short samples but the model performance is worse than the model trained by long text samples.... :frowning_face:

1 Like