I want to use Prodigy for the identification of side effects within social media posts. Here I use the classification receipe. I've got unbalanced data and therefore i decided to add a pattern list of some terms of an side effect to make bootstrapping. Now in the annotation process Prodigy marks me the terms which are in the list within the posts. But when the marked term within the post represents a symptom and another term, which is not part of the pattern list represents a side effect, Prodigy can not recognize that the mention should anyway labeled as an side effect mention.
E.g the list contain the word "headache". The mention looks like that: "To avoid having a headache i take aspirin. But after taking aspirin i always have a flu". The word "headache" is no side effect in this context but the mention itself should be classified as "side effect" because the user is talking about having a flu.
Do you have any tips how i can proceed?
And three more questions:
Do Prodigy save the pattern list within the generated model?
How does the classification internally work?
3.I have a predefined list of drugs. How can I use the classification receipe in such a way, that prodigy realizes, that only in combination with one of the predefined drugs there is a side effect in a post?
I've got a few questions to the training parameters of Prodigy.
By training a model I can set the number of iterations. What does Prodigy do within the different iterations? So what are the differneces between those iterations?
Another question is the parameter "batch-size" in my understanding the batch-size can just be as high, as the number of training data. For that reason I would expect, that when I use a batch-size, which is bigger than the number of training data, i would receive the same result as when i am using a batch-size which is as high as the number of training samples. But when testing, I receive a different result. How is this possible?
I also want to know what the parmameters in the evaluation.jsonl mean. What is meant with the "priority"? And from which parameter the classification depends?
Is there any detailed documentation what the parmeters mean?
Regarding the number of iterations: it's normal when training models like neural networks to use iterative algorithms, which require multiple updates to converge. This requires multiple passes over the data, termed "epochs" or "iterations". During each epoch, the weights are being updated as the model learns. So the state is changing --- specifically, the weights of the network that we're learning --- which is why the iterations are different.
I do think setting a batch size equal to the number of training examples should produce the same results as an even higher batch size. There might be some quirk of the implementation that introduces some small effect though. Possibly it's also just a difference in the random seed.
Setting batch size equal to or larger than the data set isn't really expected though. You definitely want a smaller batch size, or the training won't proceed properly. You need to update on lots of different batches of data. If you use the whole dataset, training is likely to get stuck with a suboptimal solution.
thanks for your response.
With respect to the number of iterations: I thought that based on the chosen batch-size the number of iterations is automatically fix? So as an example: I have 10000 training samples and a batch size of 1000. Than there should be 10 iterations, right? But within Prodigy I can specify the batch-size as well as the number of iterations? That's why I am wondering what with the number of iterations is meant.
E.g when I use the example above and specify additionally the number of iterations as 25. How does the input for the neural network looks like? For the first 10 iterations I would assume that 1000 of the 10000 samples are the input. But what is the input for the next 15 iterations?
Or did I got it wrong and one iteration is something like an epoch? So that in in my example we've got 25 epochs and in each epoch we've based on the batch-size a specific number of iterations. And after all data in an epoch has passed through the network, a new epoch is started with the same data?
Yeah, sorry if our terminology choice is confusing. The "iterations" parameter refers to the number of passes over the data -- what is elsewhere sometimes called an "epoch". So 25 iterations with 1000 batches means 25000 updates, with 25 passes over the data.
perfect thank you! Can you also explain me the following parameters of a dataset after annotating the data:
-score
-_input_hash
-_task_hash
-score
-priority
-spans
-_session_id
-_view_id
And the two parameters called "score" and the priority have in any case the same value. Is this normal?
Thanks in advance!
Nadine
No problem! Btw, if you didn't receive the readme.html with Prodigy (sometimes it's not passed through if you're using it in a company), you might find it helpful to have --- if you can't track it down internally send me a quick email and I'll pass it along.
_input_hash: A content ID of the input data. For text it'll be the "text" content, for images the "image" content, etc.
_task_hash: A content ID of the input data paired with the annotations, e.g. the text as well as the label, the spans, etc.
score: This reflects how likely the model believed the annotation to be
priority: If an active-learning sorter is being used, this is the field the sorter will reference. For instance, in the prefer_uncertain strategy, the priority will be 1.0 - (abs(score - 0.5) * 2).
spans: This is where NER annotations will be recorded. A textcat dataset could also have spans, if you're highlighting parts via the pattern matcher.
_session_id: An ID of the annotation session. If you're not using named sessions, it'll just be a timestamp.
_view_id: The annotation view used to create the annotation.
Thank you very much for the quick response.
I do have the readme file but i could not find any explanation of the parmeters.
Did I miss something here or are the Parmaeter not described in the file? Or are there different readme-files?
I wanted to ask whether you can answer the question in the above post.
In addition I would like to know the following (hopefully you can help me here.)
By training a classification model I use the input parameter „pattern“ to show the network a little bit the words which are very relevant for the classification. But I have also a predefined list of drugs. So the side effect should only be matched when someone is talking about a specific drug. Sometimes the text contains multiple drugs, where some of them are relevant to my usecase and some are not. How can I tell prodigy that the name of the drug is a fix prerequisite for the classification? So that prodigy looks for the side effect and at the same time takes the predefined drug into account?
I think you should probably train a model to detect side-effects in general, and then have a rule-based process to output the side-effect only if a drug of interest is mentioned. Otherwise you'll require training examples for all of the different (drug, side effect) pairs you're interested in. It's often much easier to enforce rules about how the output should behave in a rule-based way, based on recognised categories, rather than trying to train the model to obey your rules.
This sort of question is much more general than Prodigy and spaCy, however --- you'll face the same sort of consideration no matter which technologies you use to label and train statistical models. The decisions do depend somewhat on the specifics of your problem, so there's only so much help we can offer.
Some of the attributes of the task are internal details, so they might not be fully documented. The different recipes should tell you what attributes they expect from the task objects (e.g. "text") and how they set the annotations (e.g. spans).