Sentence Segmentation and Annotations


I have several questions about segmentation into sentences and how to annotate these sentences.

I work with French legal documents and their layout is very specific.

I use prodigy as follow :

  1. Load my document
  2. Pass it through the pipeline fr (small)
  3. Iterate through doc.sents and put each sent.text in a jsonl for annotation
  4. Annotate with prodigy

When I analyze the segmentation that spacy does (doc.sents), I sometimes find myself with sentences that make up only one word. The problem is that it greatly complicates the annotation, because sometimes I would have sentences of a single word, and sometimes it will miss an important piece for the annotation.

The best example that shows this is when I have a last name composed of two different names, ex:
"John DOE-WALKER asks"
Result :
sent 1: "John DOE-"
sent 2: "WALKER"

Or :
"On July 24, 2002 appeared"
Result :
sent 1: "On"
sent 2: "July 24, 2002 appeared"

The problem here is that I cannot correctly annotate John DOE-WALKER as being a PER, instead I will have to produce two different annotations: JOHN DOE- as a person and WALKER too. Or is it a Tokenizer problem? I don't know if it affect the quality of the NER.

My first question: Knowing that my objective is to improve the NER module, is it necessary that I work on better segmentation?

If so, what is the most optimal way to do it? Maybe training parser on my specific data? Or maybe write a custom sentence_segmenter? I tried the Blackstone sentence segmenter which seems efficient, but with each new document new cases arise, this option therefore seems very time-consuming.

I had another question: Is it possible to annotate entire documents in prodigy? I imagine so, but do you have any advice to do it ? Because I have the feeling that my task is more suited to annotation on the entire document.

Thanks a lot !

Hi! You don't have to use sentences as the minimum unit you're annotating. It often makes sense for general-purpose text because it's a simple logical unit – but if sentence segmentation is tricky because of the nature of the speicifc texts, you could also choose different units. For instance, maybe you could try preprocessing the raw text and split it on \n\n. That should (hopefully) give you more or less logical paragraphs that make sense as single annotation units.

If you do want sentence segmentation, you probably want to add a rule-based component before the parser so you're not relying on the parser for setting boundaries (the base model was trained on the Universal Dependencies data so it's not surprising the parser struggles with legal text). We're also working on a new trainable sentence segmentation component that you can train on custom data, which should be very useful for cases like this.

One thing to consider, though, if your goal is to train an entity recognizer: spaCy's entity recognizer implementation includes a constraint that prevents the model from predicting entities that span over sentence boundaries. This is typically pretty useful (just like preventing it from predicting whitespace-only entities), because it limits the number of possible analyses and can lead to better results. But in your case, it means that you want to make sure that you either add the entity recognizer before the parser when you train, or remove the parser alltogether.

Yes, check out this section: Ultimately, it comes down to defining the best possible logical unit that's easy to annotate and also creates a realistic setting to collect annotations that the model can learn from (and allows you to spot potential problems early on).

Thank you so much !