We're still new to ML and have a few questions related to the balance of data and what we should be accepting.
For example, in the set of data we have, we may train on 500 documents, 40 of those documents may actually have the context we are looking for. We don't have a way to scan ahead because some of the context is unknown until we read the document.
That means for every 500 documents we may see 30 or 40 documents with the entity or phrase we need to tag. Does hitting ACCEPT on that many documents that don't contain the context cause an imbalance to the documents that do contain the entity context? Is there a general guideline for this?
How will the model be used in practice? If in reality, only about 10% of the documents contain the entity of interest then it's not unreasonable to have a dataset that reflects that. If only for testing purposes.
I can imagine that you may want to get started with plenty of documents that contain the phrase of interest though, just as a starting point. You can influence the data points that you label by using the ner.teach recipe or by using --label argument in the ner.manual recipe.
Class imbalance can be an issue in machine learning problems, but even if only 1% of the data contains the pattern of interest ... if the pattern is clear, it can still be that you'll have a good performing pipeline.
If you can share some more of the specifics of the problem then I might be able to give more bespoke advice.
Thank you for the response. We're looking for contextual phrases on what a person was doing during an event. In a lot of cases this may not be described, but if it is we want to catch it. It may be "playing football" or "driving a car" or "eating snacks".
We are currently doing the following:
Bootstrapping with ner.manual and assigning a --label CONTEXT to it starting with 500 documents.
Then we train
Then we ner.correct with 500 more documents not used in bootstrapping to a new data_corrected dataset.
Then we train again using the bootstrapping and corrected dataset.
Does this sound like the correct workflow? Along with the limited labels we can apply per set of 500?
We're looking for contextual phrases on what a person was doing during an event.
Are you looking for phrases of what a specific person was doing? I can't help but notice that you're mainly collecting verbs here, and there are some linguistic features of spaCy that might help you surface examples that you're interested in that you can feed to Prodigy. Could you explain in more detail when a verb is/not of interest?
I, coincidentally, have released a small tutorial on spaCy linguistic features on our YouTube channel today. It's on a different task, but it highlights some techniques that you may appreciate.
That was a great video, thanks for sharing. I did try to go as far as I could with a rule based approach initially, but I found because I'm dealing with multiple sentences with unknown context (think medical documentation) that I couldn't find a way to make it work to identify the sentence that contains the context I'm looking for, so we took the ML based approach so it could learn by real world examples via prodigy. Unfortunately, the intent is too broad to make a rule for it as far as I could tell.
The workflow seems good as a starting point. Limited labels seem like a fact of life for now.
You might also like to have a look at ner.teach. This is a binary interface and should make it much easier to skip through the examples that have no labels.
As a final side note, usually what I try to do is to label a batch of a few 100 examples or so, try some active learning approaches and then try to run the model to see where it makes mistakes. The moments when I understand what kinds of errors the model tends to make are also the moments when inspiration strikes for improvement.