Annotation strategy for varied pdf layouts

Hi @alphie,

Going from general to specific and organizing the pipeline in a way to would make models' job as simple as possible is usually the best strategy.

In your case, I think it means 1) extracting the relevant sections and 2) extracting entities specific to these sections. Otherwise, it would be harder for the NER model to differentiate between numbers that look very similar because the decision would be much more complex. It could be possible, though if the contexts are different enough so it's worth NER on its own as well.

That's what I meant above. The question is what modelling strategy will you employ for finding the relevant areas. The options are: 1) a computer vision solution i.e. finding the relevant areas via bounding boxes and then applying OCR to these regions or 2) converting the entire PDF to text, splitting it into sections using text-based layout rules and then train a text classification model labelling each segment.
Text classification is likely to work better than span categorization if the segments are anything like entire paragraphs with headers etc. Bear in mind that the span categorizer generates hypotheses of spans of certain lengths, so if the annotated spans are really long, the number of hypotheses is so high that it might become an performance issue (because of memory constraints).
You would need to experiment to see which works better, but combining two models in the pipeline is a very good strategy that is easy to implement with spaCy.
If you go for textcat + NER the annotation workflow would be:

  1. Preprocess the PDFs: convert to text & extract chunks/paragraphs based on layout features
  2. Use textcat.manual to create dataset for training a spaCy text classifier
  3. Use ner.manual to annotate the spans for entities for each textcat class (you would end up with several NER datasets). Since you'd be annotating NER in the context of entire paragraphs/chunks the local context should be preserved.

If you go with a computer vision model for region extraction the steps would be:

  1. Annotate bounding boxes for training the region classifier.
  2. Pre-process bounding boxes by converting them to text.
  3. Annotate the texts from bounding boxes with ner.manual as in 3) above.

If you end up splitting PDFs, make sure to store reference to the document the chunk comes from in the meta as we've discussed before. The splitting into train/test/dev should be made on document level to avoid potential training data leakage to the test set and also make the splitting future proof e.g. in case you decide to change splitting strategy in future.

This way your pipeline would be modular and the models will have smaller and less complex tasks to solve. As discussed before, you could have a separate NER model per section or experiment with joint learning but for joint learning you'd need to implement the network architecture yourself as the defaults in Prodigy and spaCy prefer the separation of tasks as it easier to evaluate and improve independent components.

Since you'll need NER annotations, I'd probably start with NER and see how it works. If it's not satisfactory, I would add a textcat component in front to see if more specific NER models are an improvement.

Also, just for the reference here's a relevant discussion on combining textcat with NER which may be of interest: Combining NER with text classification - #6 by honnibal