I have a large corpus of data, let's say 500,000 sentences.
And I am aiming to train an NER model to detect FORMS (legal/official documents in the content). For example, a P45 form in the UK.
I have a list of 1,000 FORMS and match patterns. Although there are definitely more FORMS in the corpus.
Realistically, the occurrence will be sparse. Unlike @ines example for food items, where it was likely that each reddit comment, or most of them, would contain a food item.
If I use match patterns for the 1,000 FORMS, does Prodigy 'surface' those examples for annotation? I would like to avoid spending a long time trawling through thousands of example sentenced that are irrelevant, before I find useful sentences to annotate. Does having match patterns fix the problem of having sparse occurrences of entities within a large corpus?
It depends a bit on how you prefer to label and how much trust you have in your patterns.
You could use your patterns to pre-load a
ner.manual task (docs link) via the
--patterns parameter. That way you merely need to confirm if the patterns are accurate instead of highlighting everything yourself. This would save a lot of time, but you might still want to spend time labeling just to confirm that there are no hiccups.
For example, I can imagine that text like
P45/44 can be written with the intention of referring to two forms. But the tokenizer might split this up as
44. Your patterns might not catch this, but examples like this do deserve attention while labeling.
Alternatively, you might also choose to prioritize examples where no form was detected by your rules. The easiest way to do this is to prepare a separate
.jsonl file that only contains examples of interest and to pass those to prodigy.
Thanks @koaning for your detailed answer.
However, I'm not sure if I phrased my question clearly.
I'm more concerned about getting to those match patterns quickly when doing
ner.manual or something like that.
For example. Let's say I have a .jsonl file with 100,000 sentences. The first 99,000 sentences are about drugs. The final 1,000 sentences are about food.
I also have a patterns.json file with 300 food patterns.
I want to do ner.manual on the 100,000 sentence dataset. However, it would take me a long time to even get to the sentences about food.
Given I have match patterns, is there any way I can leverage these match patterns to prioritise sentences that have have patterns?
In a perfect scenario, those food sentences in the end of the dataset would be streamed with priority, as they have matches in the dataset.
Does that make sense?
I see what you mean there, and it sounds like you may want to precompute here.
If you have 100K examples, of which only 1K is interesting, then it sounds like you may want to fetch the interesting examples out beforehand. Filtering that many examples is something you want to do only once. Preferably upfront, not while you're labeling.
I might take a Jupyter notebook where I load the matcher and go through all the examples once, filtering and only keeping the examples where a match is detected. I would save these examples in a separate file, something like
subset-food.jsonl and I would proceed to provide this to
ner.manual via a command like;
prodigy ner.manual dataset-name blank:en subset-food.jsonl --label FOOD
Note, that if you wanted to you could also attach patterns to this command if you want to see the entities to be pre-filled in Prodigy.
prodigy ner.manual dataset-name blank:en subset-food.jsonl --label FOOD --patterns patterns.jsonl
This is a very direct way of assigning priority during labelling, but that's also why I like to work this way! You have very tight control over what you label and that might be exactly what you want. The only downside is that you will need to write a little bit of extra code to handle the filtering in a notebook first.
I suppose alternatively, you could also write a custom recipe where you implement a custom stream that has some custom logic like;
stream = (e for e in stream if pattern_match(e['text']))
This would theoretically work, but the main downside is that because we're working with generators here we are doing the filtering at runtime. Theoretically, you may need to wait until we filter through 10K examples before hitting 1 example about food. It's certainly possible to go down this route, but preparing a subset upfront seems like the more pragmatic/direct approach.
Thanks again @koaning for the very detailed and prompt reply.
Thanks for clearing that up! I had incorrectly assumed that the
PATTERNS flag would mean that the matches are surfaced first. I think this would be a neat feature, especially for large datasets with sparse entity occurrence.
Thanks for the notebook tip too. Seems like the logical approach.
I also just stumbled upon the
match recipe, which means I can accept a sentence to a database, if a match is detected.
prodigy match food_matches blank:en Data/massive_dataset.jsonl --label INGRED --patterns Data/food_patterns.jsonl --label-span --combine-matches
Then I could use
ner.manual on that dataset to reannotate.
Thanks again for all the help!
Happy to help .
My experience with labelling is that there are indeed multiple ways of going about it. My best advice is to remain pragmatic but also to reflect once in a while and to experiment a little.
In case you're interested, my recent video on data deduplication with Prodigy serves as a nice example. After labeling I realized that I could iterate on the way that I label, which greatly improved the experience. The relevant segment can be viewed here.
I suppose there's one final comment I might add here now that I think of it.
The downside of only labeling examples with patterns is that a model might "learn" that every sentence needs to have an entity in it. In general, it's also a good idea to provide examples that don't have any entities in them, just to balance the dataset a bit. Your training/test data will need to resemble the data you'll get in production as much as possible, and I imagine that not every sentence will have a food entity.
Thanks @koaning. Noted.
In reality the NER training model will have a few other categories in the training data, and I will be sure to include blank sentences.