Annotation strategy for varied pdf layouts

I am annotating pdfs to extract structured data. I extract the text using a custom loader and apply ner or span. I started using pdfs with the same layout as proof of concept. I am getting good results by creating

  • Annotations1 from pdflayout1 to train model1
  • Annotations2 from Pdflayout2 to train model2

But when I train a new model (model1&2) using both annotations1 and annotations2, model performance reduces significantly. That suggests to me that I need to do more training with more pdflayout1 and pdflayout2 examples.

I am now starting a more concentrated experimentation phase collecting a lot more annotations. But there are multiple varied pdflayouts, maybe 20 or more. Given the time consuming nature of annotations (even with patterns, and ner/span.correct /teach) I want to devise a good strategy.

I favour annotating pdfs with the same layout at the same time, as annotation is quick with patterns. (I have tried annotating a mix of pdf layouts and annotation is more painful)

Question 1: I am trying to decide an efficient way to approach the annotation. If I end up with, say, 20 sets of annotations for 20 different pdflayouts, I need a strategy to train the model with the 20 datasets. I found you can train with multiple datasets, but is there a limit? Would 20 datasets be OK, what if it ends up being 40?

Question 2. I also considered just annotating then training with one pdflayout, and incrementally annotating/training with more pdflayouts, making use of correct and teach. But, that is a big investment, and if I decide I want to not include some pdflayouts, I am potentially faced with restarting the annotation process. Is there a way to split out annotations?

related to Q2; I saw this support post

which states: When you add the samples of new data types (syllabi and curricula) it's probably best to add some data type identifier to the meta of each example so that you can easily do your experimentation

Question 3: can you explain/point to docs: how to add a data type identifier to the meta of each example

I know experimentation will be needed, but your best ideas on accumulating annotations and being able to flexibly use them as I realise I need to do something slightly different would be appreciated

Hi @alphie ,

What you're observing is not uncommon. The drop in performance may be due to:

  • Conflicting patterns between layouts
  • Insufficient data to generalize across layouts
  • Overfitting to specific layout features

In this case, another option to consider would be to have an ensemble of layout-specific models, where each model is an "expert" in a given layout. At run time, you could then choose the prediction with the highest confidence which should come from the right "expert". Alternatively, you could have a classifier to determine the layout type, then apply the appropriate model.
However if you have 20 layouts so different that they likely need a separate model (it's worth confirming it emprically) then that becomes burdensome.
Getting more data for each layout to make sure the model can generalize from all the patterns is definitely reasonable.

You might also consider a more modular approach where you have one model responsible of extracting the relevant section from different PDF layouts and then have one NLP model independent of the PDF structure.

Re Question 1
I think you would just merge or concatenate these 40 datasets and pass them as 1 dataset. train recipe allows you to pass coma separated list of datasets.

Re Question 2
I think using correct and teach would make sense if your base model has been trained on a sample of all layouts. In other words, the process would be to annotate a little bit from each layout and then improve the model and dataset incrementally with teach. Otherwise, the model won't be able to give sensible suggestions.

As to meta information of each example: Since Prodigy examples are Python dictionaries, you can add the meta key to each example which will contain your custom data identifier.
For example in your custom loader, you could add a function that modifies the example like so:

def add_layout_id(example: Dict, layout_id: int) -> Dict:
    meta = example.get("meta")
    if meta is not None:
        example["meta"]["layout"] = layout_id
    else:
       example["meta"] = {}
       example["meta"]["layout"] = "layout_1"
    return example

Then, if you want to do error/performance analysis taking into account the layout_id it should be easier to filter out the right examples to look at.

In summary,
You should continue with the annotation process that you find most efficient i.e. annotate different layout separately.
Then, aim for a balanced dataset across different layouts to prevent bias towards any particular format.
Once you have a good baseline, it's worth putting the model in the loop make the finetuning more efficient with teach and correct recipes.
Finally, I'd explore how different those layout really are. Maybe it's worth having a layout classifier that would be able to separate them into 5-7 classes and then you could try the ensamble approach.
Yet another alternative would be to have a model for extracting relevant sections and the separate NLP model that would be agnostic to PDF layout and could be trained over the entire text dataset.

In any case, you will need a good sample of annotations for each layout and for that I would recommend annotating each layout separately with an identifier as discussed above so that you can easily control the representativeness of your dataset and how each layout performs individually and in the collective model.

1 Like

Thank you so much. This was just the sort of advice I was after.

Annotating different layouts going well and the model is reasonable at finding entities.

However, my text contains a lot of numbers eg site number, transect distances, sample depths, sample number, test number, survey details eg easting, northing, level. Whilst some are distinctive, a lot are similar / sometimes cross over (sometimes because users did different things!!!). As a human annotator I can tell which numbers are which label partly by their position in the annotation text and partly because I can look at the pdf!

I am thinking about trialing your suggestion for having a model for extracting relevant sections. I am thinking I could do a span model to define the groups eg site information, sample details, tests, species, habitat and then a ner model to define the entities eg for samples: soil distance, soil depth, vegetation distance, faeces, pond. But I am unsure of the workflow.

Is the workflow something like:
ā€¢ do span model to identify text for site info, samples, tests etc.
ā€¢ Extract the spans into separate text files for site info, samples, tests etc
ā€¢ annotate the new text file and train separate ner models for samples, tests etc.

I am slightly concerned that takes the numbers out of context ā€“ away from nearby words like sample, test, species etc, that may be helping the model (and also the annotator!).
It is also a lot of models and a lot of steps.

Or could I somehow use two models together ā€“ one to find the general area for (say) samples, another to find the specific labels (sampleNumber, sampleDistance, sampleDepth, sampleMedia, sampleComments?)

As an aside, I am using a pdf extraction technique that collects bounding boxes. I do not know if the model is taking account of the bounding boxes by default, or if there is something I need to do to leverage the bounding box information - this information may also help the model know the relative positioning of the information

Many thanks as ever for your very useful ideas.

Hi @alphie,

Going from general to specific and organizing the pipeline in a way to would make models' job as simple as possible is usually the best strategy.

In your case, I think it means 1) extracting the relevant sections and 2) extracting entities specific to these sections. Otherwise, it would be harder for the NER model to differentiate between numbers that look very similar because the decision would be much more complex. It could be possible, though if the contexts are different enough so it's worth NER on its own as well.

That's what I meant above. The question is what modelling strategy will you employ for finding the relevant areas. The options are: 1) a computer vision solution i.e. finding the relevant areas via bounding boxes and then applying OCR to these regions or 2) converting the entire PDF to text, splitting it into sections using text-based layout rules and then train a text classification model labelling each segment.
Text classification is likely to work better than span categorization if the segments are anything like entire paragraphs with headers etc. Bear in mind that the span categorizer generates hypotheses of spans of certain lengths, so if the annotated spans are really long, the number of hypotheses is so high that it might become an performance issue (because of memory constraints).
You would need to experiment to see which works better, but combining two models in the pipeline is a very good strategy that is easy to implement with spaCy.
If you go for textcat + NER the annotation workflow would be:

  1. Preprocess the PDFs: convert to text & extract chunks/paragraphs based on layout features
  2. Use textcat.manual to create dataset for training a spaCy text classifier
  3. Use ner.manual to annotate the spans for entities for each textcat class (you would end up with several NER datasets). Since you'd be annotating NER in the context of entire paragraphs/chunks the local context should be preserved.

If you go with a computer vision model for region extraction the steps would be:

  1. Annotate bounding boxes for training the region classifier.
  2. Pre-process bounding boxes by converting them to text.
  3. Annotate the texts from bounding boxes with ner.manual as in 3) above.

If you end up splitting PDFs, make sure to store reference to the document the chunk comes from in the meta as we've discussed before. The splitting into train/test/dev should be made on document level to avoid potential training data leakage to the test set and also make the splitting future proof e.g. in case you decide to change splitting strategy in future.

This way your pipeline would be modular and the models will have smaller and less complex tasks to solve. As discussed before, you could have a separate NER model per section or experiment with joint learning but for joint learning you'd need to implement the network architecture yourself as the defaults in Prodigy and spaCy prefer the separation of tasks as it easier to evaluate and improve independent components.

Since you'll need NER annotations, I'd probably start with NER and see how it works. If it's not satisfactory, I would add a textcat component in front to see if more specific NER models are an improvement.

Also, just for the reference here's a relevant discussion on combining textcat with NER which may be of interest: Combining NER with text classification - #6 by honnibal

Thank you very much for this, I have now developed my custom loader to extract the different parts of the PDF based on layout features and the text is now coming out much more consistently top to bottom, left to right. (And with a bit of luck I may be able to just do NER and not need the textcat route, but we will see).

As a result, lots of my entities are now extracted close together in the text rather than being scattered through the text (giving the model the context of nearby words to help it). However there are still some entities which are in funny places in the text which doesn't reflect their location in the pdf and the helpful context words may be too far away to be 'seen' by the model.

I am wondering whether information on bounding boxes / coordinates of the various bits of text could be helpful to the model eg " label_X is generally located in the top left of the pdf. "

My question is, does the model automatically see x0 x1 y1 y1 information and take it into account. Or do I need to do something to help the model know bbox/coords are important features.

To give you a bit more context:

My custom loader adds coordinates to variable "text_data" created by extracting the bounding box coordinates and text for each block and are then added to "blocks"

# code to read the stream of pdfs

        for page_num in range(len(doc)):
                    page = doc.load_page(page_num)  # Load each page
                    blocks = page.get_text("blocks")  # Get text blocks
                    text_data = []  # List to store text data with coordinates
                    full_text = ""  # String to store concatenated text
                    for b in blocks:
                        # Extract the bounding box coordinates and text
                        x0, y0, x1, y1, text = b[:5]
                        text_data.append({"x0": x0, "y0": y0, "x1": x1, "y1": y1, "text": text})
                        full_text += text + " "  # Concatenate text blocks
                    # Yield a dictionary for each page with 'text' and 'meta'
                    yield {"text": full_text.strip(), "meta": {"source": str(pdf_file), "page_number": page_num, "layout_id": layout_id}, "blocks": text_data}

the resulting jsonl used for annotation looks like:

"blocks": [{"x0": 142.492, "y0": 794.2093984, "x1": 314.1191055999999, "y1": 808.5211984, "text": "great crested newt"}

After annotation, blocks is in the jsonl, but it is not obvious to me how the model 'sees' x0 etc.

The labels in "spans" include reference to tokens (?and I think characters?) - but not the coordinates.

  "spans": [
    {
      "start": 28,
      "end": 32,
      "token_start": 7,
      "token_end": 7,
      "label": "SSSI"
    },

So to recap:
does the model automatically take account of the coordinates or do I need to do something to make that happen?

Thank you once again for this very helpful forum.

Hiya

It was a busy weekend with lots of questions, so I just want to check if you had missed I'd added a new question to this thread?

Not being impatient, just checking. Hopefully, you know I love your forum and think you do a great job!

Hi @alphie!

It's great to hear that you've managed to extract text consistently from the different layouts.

Regarding your question about coordinates and NER:

By default, the spaCy NER model does not automatically take into account information other than text spans.
To leverage this information, you would need to explicitly incorporate it into your training data and model input in production. You would need to find they best way to embed the spatial information and implement a custom model architecture capable of processing spatial information as a feature.

One possible experiment you could do with the default spaCy NER transformer architecture is to express coordinates in natural language (maybe even using descriptors such as bottom-left etc.), concatenate with the span text and try using that. I'm not really sure though if that would make a difference, plus I imagine it would be hard to obtain such spacial descriptors in production i.e. on unseen, test data. Note, that it needs to be a transformer because CNN doesn't have big enough context window to take such additional information into account.

Ideally though, you can tweak your text extraction function to make sure the entities are placed within their context or try to extract them with rules as adding the spatial information may increase model complexity and will likely require more training data overall.

Thank you Magdaaniol - I now know that will be quite difficult to do what I imagined! (learning all the time!). But your answer has given me another idea: the transformer might be worth trying anyway for the larger context. Thanks also for the reminder that more complexity means more training data, something I hadn't fully appreciated.