Sorry for the many questions - I hope I don't waste too much of your time. I couldn't agree more with this users comment the other day.
For the sake of simplicity let's say i want to make an entity SEK_AMOUNT, e.g. 10 from the expression SEK 10. I'd like to teach a NER model to capture this - this is a toy example.
Using patterns I can easily capture SEK 10 with a few false positives whereas 10 is not possible without getting a lot of false positives. How do you propose to proceed?
In a real world example I have a custom component that uses the EntityRuler. It creates entities but with false positives. However it is a good starting point to collect entities from scratch by using the existing entities plus some logic around those, but the entities from my component should NOT be saved as entities. Should I write my own recipe for this? Probably something close to ner.match?
Off-topic: when do you announce spaCyIRL for 2020? Hopefully you'll continue the great success from this year!
I think a custom recipe would work well for your problem. You could just recognise SEK 10 and then trim the entity down with a rule afterwards. Alternatively you could just have the classification technology recognise the whole phrase, and then only use the numeric part in your application?
I think using the patterns to train a model, probably with a custom recipe so you have better control and can customise things, seems like a good approach.
I did around 200 annotations and got a model with 40% accuracy (just wanted to sanity check the model) using en_vectors_web_lg. Then I tried ner.teach with the new model but it suggested almost every token as an entity which puzzles me. Then I just rejected a whole lot so I have 265 accepted and 1176 rejected in total. Now when I try to ner.batch-train again I get the following error
ValueError: [E103] Trying to set conflicting doc.ents: '(166, 167, '!M_AMT')' and '(154, 167, '!M_AMT')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
I'm guessing it has to do with the binary annotations task (one task per matched span instead of one task per document)? The initial batch train output puzzles me as well
Loaded model en_vectors_web_lg
Using 50% of accept/reject examples (210) for evaluation
Using 100% of remaining examples (305) for training
Dropout: 0.2 Batch size: 10 Iterations: 10
The binary annotations work best for improving an existing model. If you're starting from scratch, often the model struggles to refine the definition of the task, given the weak supervision. So that might be what's happening here. You could try using the --no-missing flag, which declares that any entities that aren't present are incorrect. If your annotations don't have many missing entities, this would probably work quite well.