I have a task of finding quantity of oil in a bunch of texts without the use of regular expressions.
For example,
Preliminary estimates of the size of the discovery range between 0.5 and 1.5 million standard cubic metres (Sm3) of recoverable oil equivalents.
Preliminary estimates place the size of the discovery between 1.5 and 4.0 million standard cubic metres of oil equivalents.
Preliminary estimates of the size of the discovery range between 0.5 and 3 million standard cubic metres (Sm3) of recoverable oil equivalents.
I tried making a seed list of 30-40 such instances and run ner.teach on it. After covering 15% of data, I get a notification saying “No tasks Available.” I assume the reason for this notification is because the model cannot cross a certain threshold to make suggestions for annotations.
Is there a way in Prodigy to find such patterns OR find patterns that are range-based without the use of regular expressions.
The patterns you can pass in via the --patterns argument don't use regular expressions – instead, they are token-based and follow spaCy's Matcher style. When you run ner.teach, the patterns are used to find potential candidates and update the model. This makes it easier to get over the "cold start problem" and ensure that the model sees enough positive examples. So you can do stuff like:
The above pattern will match three tokens: one that's like a number, followed by one whose lowercase form matches "and", plus another one that's like a number.
It's okay if the patterns are ambiguous and yield false positives – in fact, this can be nice, because it gives you both positive and negative examples to learn from, especially those that are quite similar. I'm sure there are a lot of creative solutions you can come up with for the patterns – see spaCy's Matcher docs for an overview of the available token attributes. When you write the patterns, I'd also recommend to double-check spaCy's tokenization on an example, to ensure that the text is split correctly and the pattern matches.
You could also try combinations of different recipes – e.g. ner.teach, ner.match (will only let you annotate pattern matches) or ner.make-gold to correct an existing model's predictions by hand (e.g. after you've trained a first test model).
Thank you for giving a solution to the issue that I had raised.
Using the pattern matcher stated by you did the job! Thank you.
I was able to collect a batch of annotations. Surprisingly, I came across a decent number of positive examples.
Although, I was able to annotate till 28% of data till I reached the point of “No Tasks Available”. I did a ner.batch-train on the saved annotations and the accuracy was 0 %.
Is this because of lack of data?
I even tried tweaking parameters such as batch size etc., that didn’t change the accuracy either.
Please let me know your thoughts about this. Thank you so much.
28% = the progress shown in the sidebar? If you're using ner.teach, the progress you see isn't the percentage of the data you've annotated, but the expected progress until the loss reaches 0. How much of your data is annotated in total isn't that important – what Prodigy's active learning recipes care about is how the annotations you collect are improving the model.
How many texts do you have in your data in total? Depending on the size of your corpus and the entities available in the data, it's definitely possible you've reached the end of it and Prodigy can't find any more pattern matches or model suggestions. (Keep in mind that ner.teach doesn't ask you about every example – it only picks the ones the model is most uncertain about, and which will have the most relevant gradient for training).
How many annotations are in your "Oil" dataset? If the model never actually predicts your entity, you might end up with 0% – especially if you start off with spaCy's pre-trained model which already has pre-existing categories that were trained on lots of examples (probably significantly more than you have for your new category).
In our video tutorial example, this is less problematic, because we're training a label DRUG that applies for words that previously weren't recognised as anything (and if they were, maybe only as PRODUCT with a low probability). However, in your case, the numbers contained in your range entities are likely very strong predictions for CARDINAL, ORDINAL or MONEY. So teaching the model that these boundaries are wrong and your boundaries are right will need a lot of data – or a different base model.
One thing you could try is to start off with a "blank" spaCy model:
This may make it easier to evaluate your model and test your data in an isolated environment, and with a model that doesn't know any other entity types yet.