I’m going through the getting started docs to get familiar with prodigy, and I was wondering how can I go about creating a custom loader from scratch. I have all the documents I’m going to annotate in an elasticsearch index, and I wanted to feed them to prodigy in order to produce a corpus for training a custom NER model, using only specific labels related to my business.
The easiest solution would probably be to write a loader script and then pipe its output forward to the recipe. If the source argument in a recipe is not set, it defaults to stdin – so you can pipe in annotation tasks from a previous process:
Your loader script needs to retrieve the data and then output the individual examples in Prodigy’s JSON format. e.g {'text': 'some text'}. Here’s a minimal example:
import json
texts = load_your_texts_from_elasticsearch() # load whatever you need
for text in texts:
task = {'text': text} # you can also add other things like 'meta'
print(json.dumps(task))
Prodigy’s philosophy is usually: “If you can load it in Python, you can use it.” In fact, you could even write your loader in a different language if you prefer, as long as it outputs the right format If you’re loading a lot of data, you probably also want to make sure you’re only retrieving what you need instead of preloading everything at once. This should also work well with Prodigy’s generator streams – you can start annotating pretty much immediately while your loader takes care of filling up the buffer.
I know it’s just an example, but what’s the effect of using the ner.teach recipe using the en_core_web_sm model? It would serve as a pre-trained model in a scenario similar to transfer learning?
If you're working with text (NER, text classification etc.), the "text" field is usually where Prodigy expects the raw text, yes. You can find more examples of the JSON formats in the "Annotation task formats" section of your PRODIGY_README.html (available for download with Prodigy).
If you're loading in a pre-trained model, you can use the ner.teach recipe to improve its existing categories. If you're adding a new category, the existing categories in the pre-trained model will also have an impact on the entities that are suggested. By default, Prodigy uses beam search to get multiple possible parses of the same sentence, and the existing predictions define the constraints. For some cases, that can be a good thing – for others, it might be less efficient.
If you want to train a completely new model from scratch, you might want to start off with a "blank" model instead that only includes the tokenizer, tagger and parser. You can easily export that from spaCy by calling spacy.blank('en') (or any other language) and saving the result to disk. Or you can use this handy one-liner:
Prodigy can load any model in spaCy's format – so you can pass in the name of a package, but also the directory path, e.g. ner.teach dataset /path/to/model etc.
Hi Ines, I tried doing as you suggested and I noticed that prodigy tokenizes the text I'm feeding to it, following some criteria I. How can I force it to present to the annotator exactly what I'm sending to it? One display per text.
If you’re using ner.teach, the texts will be segmented into sentences. You can turn this off by setting the --unsegmented flag (see the docs for more info).
Turning the auto-segmentation off means that your script needs to make sure that the texts aren’t too long. The reason behind this is that the active learning-powered recipes use beam search, which is most efficient on shorter text (since it needs to compute all possible parses for the whole text). So if there’s a very long text somewhere in your stream, this can potentially make the process very slow.
Since you’re only annotating one entity at a time, it’s also usually more efficient to only look at the sentence containing the entity or a similar window around it. If the human annotator needs significantly more context around it to make a decision, the model is also much less likely to learn anything meaningful from it. That’s why we usually recommend to keep the annotation tasks short – at least if you want to annotate with a model in the loop. For manual annotation, this is less of a problem.
I'm thinking about creating some sort of analyzer of sentence statements emitted from a court of law. In order to do this, I thought about presenting larger pieces of information from the sentence to a lawyer, and have him annotating the parts of the text denoting the plaintiff's requests, the judge's exact decision (accepting or denying the plaintiff's request), and references to laws and jurisprudences.
An example:
"3. REGARDING BREAK:
The plaintiff states that throughout the labor agreement, she always worked from Monday to Saturday, from 06:00 am to 2:00 p.m., with only a 40-minute break. Requires payment of one hour per day for the partial grant of the interval.
The complainant says that there was a reduction of the interval to 40 minutes and that there was compensation of overtime, according to a collective agreement between the company and the union of the category of the plaintiff. It records that the plaintiff worked only 7 hours and 20 minutes daily. However, it did not bring the plaintiff's timecards, nor collective bargaining agreements, to the process.
Thus, it is DEFERRED to pay one hour for not granting a minimum interval of one hour (since the defendant did not prove that she had authorization from the Ministry of Labor for such reduction, during the validity of the plaintiff's employment contract, in the form of paragraph 3 of Article 71 of the CLT, and that the partial concession implies the full payment of the period of the interval, according to the understanding embodied in Precedent 437, of the TST), during the term of the contract of employment. An increase of 50% (fifty percent) in the amount of the remuneration of normal working hours shall be deducted from the time of interval abolished, in accordance with article 71, paragraph 4, of the Consolidated Standard, with repercussions on prior notice, holidays + 1/3, FGTS + 40% and RSR.
Quantification in terms of the attached worksheet, which integrates this sentential command for all purposes."
The lawyer would annotate the sentece above highlighting the words that denotes the plaintiff's request (break / interval), the judge's decision (DEFERRED) and the law references used in his decision (paragraph 3 of article....etc). I believe this is a good fit for a NER task, right?
I also thought about having the lawyer annotate parts of the text that support the judge's decision, such as "the complainant did not prove that she had authorization from the Ministry of Labor", or "the complainant did not bring the plaintiff's timecards". For this part, it would probably be a text classification task, right?
How do you think prodigy can support me in this scenario?
Yes, this sounds like it's going to work pretty well – especially since those decisions can be made from a narrow window of surrounding context. Having a good legal NER model will also make it easier later on to extract relevant paragraphs for further annotation later on.
You're also on the right track in terms of how you're dividing the larger task into smaller subtasks that can be solved as individual machine learning problems. (We've actually found that this is one of the biggest problems for many projects – so this is also something we're trying to make easier with Prodigy.)
Yes, and ideally, you could also use the previously trained NER model to help with the selection. You'll likely also achieve better results if you can break down your categories into binary or multiple/single choice questions.
Labelling legal terms in texts is easier, because there's a clearer definition. Asking someone to highlight "parts of the text that support X" can be very difficult, because every person might have their own interpretation and you can end up with very different answers that also provide very little value for training a model. It also makes it more difficult to determine whether your annotators agree or not.
What usually works well are questions like "Is this text about X?" or "Which of these 5 categories apply to the text?". Prodigy's textcat recipes also come with a --long-text mode that will extract individual sentences from longer texts and present those in context (instead of asking the annotator to read a much longer text). In the end, you'll be averaging over a lot of individual predictions anyways – and if the annotator can confidently say that a sentence is about topic X, this is very valuable information (and much quicker to collect than annotations on whole documents only).
You might also want to check out @wpm's posts on the forum – if I remember correctly, he was working on a similar use case, and shared a lot of his findings, work in progress and custom recipes on here
But do you think it's a narrow window? I'm under the impression that the annotator would need to view that whole piece of text I pasted before, in order to actually label each token appropriately. Maybe not the legal terms, but the plaintiff's requests have some sort of context that characterizes them as requests .
I'll take a look at this recipe and study it some more!
I was mostly referring to the general legal terms, because this seems like a good place to start. So you could start off by adding / improving the domain-specific categories (like, sections of law, or organisations like "the Ministry of Labor"). These annotations can be super fast and efficient, because they require less context. You can probably also do some smart stuff with match patterns to help bootstrap the categories.
Doing this is also helpful to determine how you want your model to work, and what label scheme works best. (As a really basic example: should the entity type be "paragraph 3 of Article 71 of the CLT"? Or should the model recognise "paragraph 3" and "Article 71" separately?) These details really depend on your application and often change during development – so ideally, you want to figure this out before you commission all the work from your domain experts.
Once you have a model that is really good at legal entities (or even, legal entities in your specific documents), you can use this to extract more complex relationships. Even if the lawyers you're asking to annotate will have to read longer texts, you can help them by highlighting an entity and asking a more specific question about the label. For example, you can highlight the legal references and ask whether it refers to a decision etc.
I’m actually struggling a bit with your framework, I’m still getting my way around the whole “recipes” concept.
In the NER training circuit, I’m more used to reading scientific articles and running different deep learning architectures in standard, general content corpora. I’m even interested in learning more about spacy’s architecture using beam search, can you indicate any paper on it?
I feel I need to get more familiar with the whole active learning concept, and they way prodigy’s recipes work. Should I follow the prodigy readme to absorb these basic concepts? I’m a person who learns better by example
The beam training is the same as described here: https://aclanthology.info/pdf/Q/Q14/Q14-1011.pdf . Specifically, we use the dynamic oracle to search for the best set of parses that conform to the gold-standard constraints, and update the weights so that the parses in that gold-standard set become more likely under the model. Importantly, the constraints don’t need to fully specify the parses — so we can train this way from partial annotations.
So, I finished watching the videos, and I also watched the insults classifier training one. There are some of the questions I have:
In the insults training video, the amount of “accept” comments are way lower than the “reject” ones. Doesn’t this cause an imbalanced dataset issue? How do you handle this?
Regarding the terms.teach recipe:
2.1. Is it only intended to produce an enhanced list of seeds, based on the initial seeds provided to it, or does it also update the word vectors somehow?
2.2. I suppose the terms are recommended based on a similarity given by the word vector model used, right? 2.3. Do you also do some sort of traversal of the corpus used for annotation, yielding for prodigy the sentences which are closest to the terms provided in the seed? Steering the tool towards the texts using those terms.
In the ner training video, you used some english model from spacy to bootstrap the ner.teach recipe. Altough the model is used here, it’s not updated until you run the ner.batch-training recipe, right?
Do you consider the progress bar to be the estimate of how much text or annotations would be necessary to train custom models? I need to provide to the management an estimate, so I need to understand the criteria for this estimate.
I was a bit confused about the “ner.print-stream” example on the new entity video. What is the difference between the “source” and the “–loader” parameters? I was under the impression that both were supposed to feed texts to the recipe.
Yes, that's true – but this isn't a bad thing here, because fundamentally, the data is very imbalanced, too. Most texts on Reddit are not insults. Using the active learning bootstrapping also helped in this case, because it made sure we annotated more insults proportionally than we would have if we had just streamed in the text in order.
We're not updating the actual word vectors here, but the target vector. So as you click accept and reject, the suggestions will change based on what you've accepted and rejected so far.
Yes, exactly. The recipe iterates over the model's vocab and will suggest the terms that are most similar to the target vector. The source of the recipes are included with Prodigy, so you can also see the implementation of this in the code: we keep one object for the accepted terms and one for the rejected terms, and then use their vectors to decide whether a new term should be suggested or not.
Do you mean when using a a recipe with the patterns, like ner.teach? The patterns are only used as exact matches – otherwise, we'd be duplicating functionality from the model's active learning, which takes care of the context-based suggestions. Once you've annotated enough pattern matches, the model will start suggesting examples, too – and those are based on the context and context-based similarities. So if you have representative seeds/patterns, this should happen automatically.
The model used in ner.teach is updated in the loop as the user annotates, but it's not saved as the "final" model. You'll always achieve better accuracy with batch training, where you can make several passes over the data, use a dropout rate and tweak the other hyperparameters.
In the active learning-powered recipe, the progress bar shows an estimate of when the loss will hit 0. This is the best metric we can use to roughly estimate how many more annotations we'll need until the model has learned everything it can.
If you need to provide more detailed reports on the progress and future outlook, you might also want to check out the ner.train-curve recipe (see here for details and example output). It can help answer the question of "Will my model improve if I annotate more similar data?" The recipe will run ner.batch-train, but with different portions of the data: by default, 25%, 50%, 75% and 100%. As a rule of thumb, if you see an improvement within the last 25% (between using 75% and all annotations), it's likely that the model will improve with more annotations.
So in your report to management, you could include something like: "We collected 8,000 annotations and 2,000 evaluation examples. After training, the model achieved an accuracy of 84.2. The last 25% of the training examples were responsible for a 3% increase in accuracy on average. This indicates that we'll likely be able to improve the accuracy further by collecting more annotations of a similar type."
The ner.print-stream recipe (example here) is mostly just a quick way of previewing the model's predictions on an incoming stream of text. It uses the same loading mechanism as all other recipes that take input data.
In this example, we're loading in data from the Reddit comments corpus. Because the corpus is cool and something we also use a lot, Prodigy ships with a built-in loader for it. Normally, we try to be smart and guess the loader from your file extension (.jsonl, .txt). But you could also set that explicitly by specifying --loader txt or like in this case, --loader reddit.
What I meant is if the sentences presented in prodigy are sort of "filtered" in a way that if the user is annotating texts based on the insults seeds, then almost every text presented for annotation contains an insult or a swear-word, which would be a great coincidence, considering all reddit corpus. So I suppose you sort the texts considering the presence of the seed terms, right? That's what I meant by "Steering the tool". If I'm using the sys.stdin as the input for the model, is this also going to work?
Ok, understood this. What I meant is that if you're using --loader reddit, what was the point of the check/RC_2017-10-1.bz2 parameter (26:48 of the video)? I was under the impression that this was also some sort of input corpus for the model to show the previews.
Yes, as the texts stream through, Prodigy will start by only presenting the texts containing insults. As you annotate, the model will learn from those decisions and will start suggesting examples, too, which are also presented for annotation. This approach works especially well if you're dealing with imbalanced categories and a very large corpus like the Reddit data.
You can only use stdin for the data source, i.e. the text – but yes. The way you load in the actual text isn't important, it will still flow through the same process, whether it comes from a file or a custom script.
Ah, I think I got it, sorry. That's correct, check/RC_2017-10-1.bz2 is the file path to the data and the --loader reddit argument only says "use the loader function for Reddit". The Reddit data is distributed as JSON in monthly archive files, so we need to do some pre-processing in order to load in the text (extract the comment body, strip out HTML etc.). Because we use the Reddit set a lot, Prodigy comes with a built in loader for that, which you can enable by setting --loader reddit.
I did some testing here with some seed terms, but I was unable to produce a long list of terms, maybe because the vocabulary would be quite narrow in my situation. For an example, I commented earlier that I want to identify sentences that contains decisions emitted by judges. So I thought about terms like “decide”, “grant”, “reject”, “deferred”, and ended up getting a very small list of terms (about only 10 terms) that would carry the same meaning, after using terms.teach to scroll over about 200 terms. Most of these terms are verbs, so I’m not sure this would make any difference when applying the seeds on another task, such as ner.teach, considering named entities are usually nouns. Does it make sense to use ner.teach to identify these verbs? Maybe I should switch to a classification task instead, if it doesn’t make sense.
After producing the terms seeds, I tried providing a corpus to do some annotating, using the following command:
but almost none of the texts presented to me contained any of the 10 terms in patterns list. So I tried changing the load script to aim specifically to texts that would be more likely to contain the terms, but, even doing so, prodigy never highlights the provided seeds terms, only random nouns present in each text. Is this another hint that I should not be using verbs as “named enitites”?
Waiting for your recommendations on my scenario. Thanks!
@ines or @honnibal, do you have any saying to my previous comment? I’m reading some papers related to Argument Mining, and there is a considerable amount of papers handling AM as a sequence tagging or RE problem, such as http://www.aclweb.org/anthology/P/P17/P17-1002.pdf and http://www.aclweb.org/anthology/P16-1105. In their datasets, they annotate long sequences of texts and pass them over to models based on LSTM.
The following annotation example makes me think that it would be reasonable to use other types of words for a task similar to NER. Not sure that it would be feasible to a CNN-based model, such as the one from spacy/prodigy, since it, in theory, doesn’t capture long dependencies quite well as a BiLSTM. What do you think?