First of all, sorry for the confusion I might have caused. And thank you again for your detailed answers and suggestions! Without this support forum I would have been entirely lost
What I find difficult about prodigy is that I didn't really know, how to get the result I'm using prodigy for, namely a model to predict disease and medication entities in user-generated content. Watching your videos and reading the documentation etc., everything seemed so easy. But trying to handle my own project, I was (and still am a bit) unsure about the workflow and which recipes I should use when. That's why I tried ner.teach
, ner.manual
, ner.make-gold
and ner.batch-train
more or less randomly.
I know it's difficult to generalize to all projects but I think it would have helped me a lot to have a kind of guideline regarding e.g.
- When shall I use which recipe and how can I make the most of using them in turn?
- How many annotations are advisable during each run of a recipe?
I did that In general, pre-processing my data definetely helped a lot to improve my model's suggestions even though ner.teach
still just seems to ignore my patterns file and suggests random words.
Anyways, what I've been doing (after starting from a blank model as you suggested and some 3000 annotations of ner.teach
and ner.manual
which didn't seem to help my model) is, I focussed only on the medication entity and as explained in Issue 638 Annotation strategy for gold-standard data - #4 by idealley , used ner.make-gold
(500 annotations) and ner.batch-train
in turn.
As my model's suggestions improved, I'm now going to really begin with the disease entity.
Andy suggested to just use the same database and overwrite my model. Will that really work? Considering the problems I had at the beginning with my medication model, I fear that the combined model is going to perform badly (I think this is what is meant by the catastrophic forgetting problem).
Would it instead make more sense to use a separate dataset to train this new entity and a disease model and in the end combine the two (in case that even works?)?