I am annotating pdfs to extract structured data. I extract the text using a custom loader and apply ner or span. I started using pdfs with the same layout as proof of concept. I am getting good results by creating
- Annotations1 from pdflayout1 to train model1
- Annotations2 from Pdflayout2 to train model2
But when I train a new model (model1&2) using both annotations1 and annotations2, model performance reduces significantly. That suggests to me that I need to do more training with more pdflayout1 and pdflayout2 examples.
I am now starting a more concentrated experimentation phase collecting a lot more annotations. But there are multiple varied pdflayouts, maybe 20 or more. Given the time consuming nature of annotations (even with patterns, and ner/span.correct /teach) I want to devise a good strategy.
I favour annotating pdfs with the same layout at the same time, as annotation is quick with patterns. (I have tried annotating a mix of pdf layouts and annotation is more painful)
Question 1: I am trying to decide an efficient way to approach the annotation. If I end up with, say, 20 sets of annotations for 20 different pdflayouts, I need a strategy to train the model with the 20 datasets. I found you can train with multiple datasets, but is there a limit? Would 20 datasets be OK, what if it ends up being 40?
Question 2. I also considered just annotating then training with one pdflayout, and incrementally annotating/training with more pdflayouts, making use of correct and teach. But, that is a big investment, and if I decide I want to not include some pdflayouts, I am potentially faced with restarting the annotation process. Is there a way to split out annotations?
related to Q2; I saw this support post
which states: When you add the samples of new data types (syllabi and curricula) it's probably best to add some data type identifier to the meta of each example so that you can easily do your experimentation
Question 3: can you explain/point to docs: how to add a data type identifier to the meta of each example
I know experimentation will be needed, but your best ideas on accumulating annotations and being able to flexibly use them as I realise I need to do something slightly different would be appreciated