I want to create an annotations jsonl file using patterns with es_core_news_lg pipeline. In this case, I'm trying to annotate MONEY entities using a jsonl file I have created with the patterns.
There's any possibility to do that? Probably using custom recipes? I tried using get_stream for a data jsonl file and PatternMatcher.from_disk for the patterns jsonl file (there is the code below). But the problem is that if the text has more than one entity, it creates different json's in the jsonl preannotation file. How can I solve the problem?
import spacy
import prodigy
from prodigy.components import printers
from prodigy.components.loaders import get_stream
from prodigy.core import recipe, recipe_args
from prodigy.models.matcher import PatternMatcher
from prodigy.util import log
import json
@prodigy.recipe('ner.preannotate-patterns',
spacy_model=recipe_args['spacy_model'],
patterns=('Path to match patterns file', 'positional'),
source=recipe_args['source'],
api=recipe_args['api'],
loader=recipe_args['loader'])
def print_pattern_stream(spacy_model, patterns, source=None, api=None, loader=None):
log("RECIPE: Starting recipe ner.preannotate-patterns", locals())
model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
stream = get_stream(source, api, loader, rehash=True, input_key='text')
with open('preannotations_money.jsonl', 'w') as file:
for line in model(stream):
json_line = json.dumps(line[1])
file.write(json_line)
file.write("\n")
file.close()
Just a quick note on how you would solve this if you wanted to implement this in a custom recipe â although you shouldn't have to because ner.manual has you covered
Under the hood, the pre-annotation works like this:
Stream in all your input examples and load the patterns into a matcher.
For each example in the stream, process it with spaCy and match your patterns on it.
If relevant: Make sure that you filter the matches so you don't end up with overlapping spans. You can do this using spaCy's filter_spans utility.
Add a list of "spans" to each example containing the matches. spaCy's matcher will let you create Span objects for each match, so you can do {"start": span.start_char, "end": span.end_char, "label": "MONEY"} for each span.
Add tokens your stream using Prodigy's add_tokens helper.
I am using ner.manual with --patterns but I wanted to add to a larger JSONL-file (which includes many patterns already) another pattern that would pre-annotate money amounts as predicted by the en_core_web_sm model. When using this pattern {"label": "MONEY", "pattern": [{"ent_type": "MONEY"}]}, however, I am getting this result below:
What I had in mind was that $ 75 million is pre-annotated as MONEY combined (instead of the individual elements) like the en_core_web_sm model would typically do in the case of NER.
To clarify, I am not using ner.correct as shown under this link since it does not allow me to add a patterns file in which I would collect additional patterns than the MONEY entity described above.
How can I combine both, the classical patterns collected in a JSONL-file and named entities (such as MONEY) prediced by an existing model, for the task of pre-annotating a text?
Hello @ben.k,
thank you for your question and welcome to the prodigy community
The problem with the MONEY label results from your pattern and the model's tokenization. As decribed in the prodigy docs, a pattern is a list of dictionaries where each dictionary describes one individual token. Your pattern {"label": "MONEY", "pattern": [{"ent_type": "MONEY"}]} is just one token long, but because spaCy tokenizes $ 75 million into three tokens, each having the entity type MONEY, your pattern is matched three times leading to the result you described.
Another solution could be to write a custom recipe to solve this. However, since your problem should be solvable using the operators, I would not recommend this.
I hope this solves your problem. If not or if you have any further questions, please let me know
many thanks for your answer, this is exactly what I was looking for. Now I understand the role of the "OP" key, it is similar to the procedure in classical REGEX patterns, and it works as expected.
One addition question. Now that spaCy has recognised $ 75 million as a named entity of type money, is there (by any chance) a utility function that allows me to split this into the currency unit and the number, i.e. ('USD', 75000000), or would I do that using a procedure as described e.g. here?
In principle, with my initial pattern the different elements were labelled separately, but I would need to know whether each element is a currency classifier, the amount or the unit multiplier to proceed further. Hence, maybe it is easier to combine it using the "OP" key and then pass the entire recognised named entity into such parser-function.
Hi @ben.k,
I'm happy to hear that the proposed solution works
Regarding your question:
If you have a solution that already works, I would go with it. Prodigy does not have such a utility function.
What you could do is to incorporate this parser function into your own prodigy recipe. Another option could be to extend the patterns you already wrote, like adding a regex key and split the pattern for MONEY up into two parts like this:
However, depending on your text and the variations that you'd have to cover (e.g., different currencies,...), this might be an overhead, especially since you'd still have to convert "75 million" to "75000000".
A third possibility could be to create your own spaCy pipeline component to implement this step which might be useful if you already have a processing step that utilizes spaCy. If you have more specific questions to custom components, I'd like to refer to the spaCy Discussions forum where my colleagues will help you with your spaCy questions.
I hope one of the possibilities suits you. Like I said, I would use the parser that you proposed, especially if you have already implemented this.