annotating entities in text documents

I’d like to use prodigy to annotate entities. Unfortunately - after testing several hours - I did not find out how to manage that. I created a blank model (language: de) and trained it by classifying documents. But this was not precise enough for this annotation task.

I found this example on the web page (first steps):

ANNOTATION TASK JSONL
{
“text”: “Apple”,
“label”: “TECHNOLOGY”,
“spans”: [{“start”: 0, “end”: 5, “label”: “ORG”}],
“meta”: {“source”: “My source”, “foo”: “bar”},
“answer”: “accept”
}

Is it possible to create those annotated entities using prodigy’s web app?

Hi,

I’m sure this has been frustrating — sorry about that!

Annotating entities is definitely a core use-case for Prodigy. Actually one of our motivations for creating Prodigy was that we need to do a lot of entity annotations ourselves — so this definitely won’t be a neglected feature.

I think the main problem you’re having is that the German models weren’t available for spaCy 2 when you were testing. This left you to start with a blank model, and we don’t currently have a good recipe for doing NER from a “cold start”.

We’re releasing a new version of spaCy today, which we will be the firt release candidate for v2.0.0. This release will have a model for German entity recognition, for PER, LOC, ORG and MISC entities. You’ll be able to use this in the next version of Prodigy as a starting-point for the entity annotation, as described in this tutorial: https://prodi.gy/docs/workflow-named-entity-recognition

The other way to start doing entity annotations is to build a terminology list using the terms.teach recipe. The terminology list would then be used to suggest entities, which you would mark as Accept or Reject in context. We’re working on a tutorial for this workflow.

Matt

To add to Matt’s comment above, another suggestion: If you already have a lot of text and potential entities, you can also convert them to Prodigy’s JSONL format, and then use the mark recipe to simply collect annotations with no model in the loop or anything. The annotations will be stored in your dataset, and you can then use that to train a model – and keep improving by loading it into ner.teach and correcting its predictions.

A good strategy to extract entity candidates is to use spaCy v2.0’s new PhraseMatcher to find occurrences of words in terminology list in your data. (If you have word vectors trained on relevant data, you can also use Prodigy’s terms.teach recipe to help you select good candidates for your lists – or you just do it manually.) For example:

from spacy.matcher import PhraseMatcher

nlp = spacy.blank('de')  # create a blank German class
matcher = PhraseMatcher(nlp.vocab)  # initialise the matcher with the vocab

# Add match patterns for persons, orgs, whatever else you need
persons = ['Angela Merkel', 'Veronica Ferres', 'Dieter Bohlen']
matcher.add('PERSON', None, *[nlp(text) for text in persons])
orgs = ['Telekom', 'Vodafone', 'Lufthansa', 'Google']
matcher.add('ORG', None, *[nlp(text) for text in orgs])
# your data – ideally split into sentences or smaller chunks
texts = ['Some sentence', 'Another sentence', 'Lots of sentences']
examples = []  # store annotation examples here

for text in texts:
    doc = nlp(text)  # process the text
    matches = matcher(doc)  # find matches 
    # create spans for each match, using the match ID as the label, e.g.:
    # {'start': 0, 'end': 15, 'label': 'PERSON'}
    spans = [{'start': start, 'end': end, 'label': nlp.vocab.strings[m_id]}
            for m_id, start, end in matches]
    # add to list of examples in format supported by Prodigy
    examples.append({'text': text, 'spans': spans})

# write this to a .jsonl file
jsonl_data = '\n'.join([json.dumps(line) for line in examples])

Your data could then look like this:

{"text": "Angela Merkel zeigt sich optimistisch", "spans": [{"start": 0, "end": 11, "label": "PERSON"}]}

The .jsonl file can be loaded by Prodigy, and you can annotate them in context to confirm that they are indeed entities you’re looking for.

prodigy mark german_entities /path/to/my_data.jsonl

The above examples are probably fairly unambiguous – e.g. “Angela Merkel” will likely always be a PERSON entity. However, in the example of “Google”, you could easily end up with examples referring to both the company and the search engine (which you might want to label as a product, or something else). This really depends on your examples – so it’s always good to confirm the entities in context.

Thank you for that quick and helpful reply! I was lucky to have preannotated data. So I knew which document I had to accept and which to reject. I split them up and held down 'a' for a minute and than 'x' for another minute. Compared to other language annotation tools this was really comfortable!

You’ll be able to use this in the next version of Prodigy as a starting-point for the entity annotation

Sound's great!

The terminology list would then be used to suggest entities, which you would mark as Accept or Reject in context.

Is also a good idea to get started. It's just more handy to mark text inside the web app and tell it, that this is the relevant information it needs to learn.

you can also convert them to Prodigy’s JSONL format, and then use the mark recipe to simply collect annotations with no model in the loop or anything.

I actually did that, but without the PhraseMatcher. This would definitely work. Although I have to know, which part of my text is relevant before I even start the annotation task.

Sounds a little bit stupid but in my case I want to annotate relevant information in form of a whole sentence which is part of a text document.

To use Prodigy at this state, I could parse the document sentence by sentence (using space sentence tokenizer) and save the document ID as meta information inside the .jsonl. The annotation task would be to just accept or reject, whether the parsed sentence is relevant or irrelevant.

Should work or am I wrong?

Whats missing is the information of where to find this string in the text document, like you showed:
spans = [{'start': start, 'end': end, 'label': nlp.vocab.strings[m_id]} for m_id, start, end in matches]

Sounds like this might be a good case for the text classification mode? Just feed in the sentences, and annotate them with the label RELEVANT (or something like that). There's also a video tutorial we've recorded on this topic in which I'm training an insults classifier on Reddit data from scratch.

After collecting a bunch of relevant sentences, you can export the dataset to a .jsonl file. This file can then be loaded back into Prodigy, for example, to annotate entities within the text.

prodigy db_out relevant_dataset /output_path

Yeah, this is where the NER model comes in. You need at least some model that will predict something, even if it's bad. When you call ner.teach, the model you load in will be used to predict entities in your text, and the candidates will then be shown in the app so you can decide if they're correct or not.

We just published spaCy v2.0.0a18 with a new German model that supports NER: spaCy · Industrial-strength Natural Language Processing in Python We probably need to release an update of Prodigy that makes it compatible with the new spaCy version, so you can load in the new models.

In the meantime, you can also pre-process your data with the new German model, generate the spans from the doc.ents and then annotate them one by one using Prodigy's mark recipe:

nlp = spacy.load('de_core_news_sm')

examples = []
for text in ['A document...', 'Another document...']:
    # if your texts are long, maybe iterate over the sentences, too?
    doc = nlp(text)
    spans = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
             for ent in doc.ents]
    examples.append({'text': doc.text, 'spans': spans})

jsonl_data = '\n'.join([json.dumps(line) for line in examples])

In the meantime, you can also pre-process your data with the new German model, generate the spans from the doc.ents and then annotate them one by one using Prodigy’s mark recipe...

Thanks for this suggestion. We'll have to evaluate this approach. But I think this would work if the sentence segmentation can be done properly.

In some cases we also have to find out information which has contextual relevance. For this annotation task we would need the whole text of a document to be shown to the annotator. Maybe we have to wait until you implement the feature (annotating entities in a text containing more than one sentence).

I'm really looking forward to the release. Prodigy is much more handy than the tools we are using at the moment! By the way: when will that be?

Technically, you can already do that using the mark recipe for example. Prodigy will just render whatever you give it – so if you pass in a task with a really long text and multiple entities in the spans, this will be shown to the annotator just like this.

You could also duplicate the long texts and add one annotation task for each entity. This would let the annotator see the whole context, and give feedback on one entity at a time (which probably makes more sense).

However, in most cases, you do want to move through shorter examples quickly – which is why the built-in recipes usually split sentences. If the annotator needs a few minutes just to read the example, you'll lose a lot of the benefits of the binary interface and UX design.

Thanks! We're currently in the final stages of getting spaCy v2.0.0 stable released and after that, we'll switch back to Prodigy, get it ready for the new spaCy version, add more features an work on the v1.0 release :blush:

Thanks! We’re currently in the final stages of getting spaCy v2.0.0 stable released and after that, we’ll switch back to Prodigy, get it ready for the new spaCy version, add more features an work on the v1.0 release :blush:

Great!

However, in most cases, you do want to move through shorter examples quickly – which is why the built-in recipes usually split sentences. If the annotator needs a few minutes just to read the example, you’ll lose a lot of the benefits of the binary interface and UX design.

Maybe your design and this one specific annotation task just don't match. But thanks for the workaround. We'll try that anyway. Using Prodigy could protect our team from further mental breakdown :wink:

@tom In case you haven't seen it, spaCy v2.0 is now live! :tada:
Release v2.0.0: Neural networks, 13 new models for 7+ languages, better training, custom pipelines, Pickle & lots of API improvements · explosion/spaCy · GitHub

We'll start working on making Prodigy compatible with the new version today, so you'll be able to use it with all 13 new language models.

Yes, if you're running a large-scale annotation project, or you have a very clearly defined task that needs to be annotated "statically", you might want to use a different approach.

Prodigy's main focus is still the development/data science side of things. You often don't know whether an idea will work before you try it. So using Prodigy, you can quickly annotate a bunch of examples yourself, train a model, evaluate it and find out if your idea works. For example, even if your model isn't very good yet – if you run the train-curve recipes and see that the accuracy improves in the last 25%, this is usually a good indicator that more training examples will improve the model.

So basically, you can make sure your idea works before you commission thousands of annotations. And you can do it all yourself, without having to write annotation manuals and wasting your time in lengthy meetings :stuck_out_tongue_winking_eye:

3 Likes

There is an issue with the example @ines. I’m running the code but start and end of entities are wrong.

nlp = spacy.blank('en')
matcher = PhraseMatcher(nlp.vocab)  # initialise the matcher with the vocab

# Add match patterns for persons, orgs, whatever else you need
condition = ['hearing', 'tympanic', 'discharge']
characteristic = ['intact']
location = ['bilateral']
body_part = ['ear', 'face', 'eye']
matcher.add('BODY_PART', None, *[nlp(text) for text in body_part])
matcher.add('CONDITION', None, *[nlp(text) for text in condition])
matcher.add('LOCATION', None, *[nlp(text) for text in location])
matcher.add('CHARACTERISTIC', None, *[nlp(text) for text in characteristic])

texts = open('PE_sample_data.txt', 'r').readlines()
texts = [text.strip() for text in texts if text]
examples = []  # store annotation examples here

for text in texts:
    doc = nlp(text)  # process the text
    matches = matcher(doc)  # find matches
    # create spans for each match, using the match ID as the label, e.g.:
    # {'start': 0, 'end': 15, 'label': 'PERSON'}
    spans = [{'start': start, 'end': end, 'label': nlp.vocab.strings[m_id]} for m_id, start, end in matches]
    print(spans)

    # add to list of examples in format supported by Prodigy
    examples.append({'text': text, 'spans': spans})

here is what I’m getting:

[{'start': 1, 'end': 2, 'label': 'CHARACTERISTIC'}]
[{'start': 2, 'end': 3, 'label': 'CHARACTERISTIC'}]

Sorry – my example was semi-pseudocode for demonstration purposes. You’re right – the start and end returned by the matcher are actually the token indices, not the character indices (which you need to highlight a the exact span of text). So you need to create a Span from the entity first, using the start and end of the matched tokens.

Something like this should work:

spans = []
for m_id, start, end in matches:
    entity = doc[start : end]  # get slice of the document
    spans.append({'start': entity.start_char, 'end': entity.end_char, 
                  'label': nlp.vocab.strings[m_id]})
2 Likes

Hey Honnibal,

I am currently working with a blank model (French Canadian) and I am looking for a tool to quickly annotate text. I built something for myself using BRAT annotation tool because I need to be able to correct the annotations when starting from scratch, otherwise the model would be always wrong.

I do get you when you say in the NER Prodigy document that you want to send the less bits possible and make the user do the less actions possible, but in my own sense, when I train a model from scratch, I would want to be able to correct few annotations. Else I would always be hitting the “X” button, ending without training examples.

I think it is a totally different use case but if you’d want to have somebody working on the cold start problem, I will definitely explore some ways to handle it in my school research.

Hello Ines,
Sorry for spamming this whole thread but I have a couple of question about your tool which seems really interesting.

How do you propose examples given a model and it’s performance? Do you try to find examples at which the model is bad so it will boost its performance when retraining it? If it’s not the case, I think it would be an insane feature :slight_smile:

A little bit inline with the question I asked to Honnibal, proposing sentences at which the model is bad gives a high probability that the annotations won’t be 100% correct so being able to edit the annotations would in some sense be useful, i think.

You're right that using the active learning from a "cold start" isn't very efficient. I think if you're starting a new model the best way is actually to use the terms.teach recipe and create a word list for the entities you're interested in. Then you can create a rule-based system that starts suggesting you candidate entities, which gives you something to say yes/no to. This then gives the model something to learn from, so the active learning can get started.

You can also start the training by doing manual annotation, with something like BRAT. However I think there's a similar issue, because you want to make sure you select texts which have a decent number of the entities you're interested in. This means you end up using a rule-based approach to "bootstrap" as well.

We're working on a tutorial and an extra NER recipe, ner.bootstrap, that makes this workflow more explicit. It's working quite well in our testing so far, especially when using more detailed patterns with spaCy's Matcher class.

You can control this by setting the sorter in the recipe, but by default we use what the literature calls "uncertainty sampling": we pick examples where the confidence is closest to 0.5. This policy produces the largest expected gradient. There are some tricks to doing this nicely in a streaming setting, while keeping the application responsive. Sometimes it's good to bias the sampling towards predictions of "True", because we can use annotations that are marked "accept" more directly. If we answer "reject" we don't come away knowing the annotation, just that the model was wrong.

The uncertainty sampling is done by the function prefer_uncertain. The bias argument lets you shift towards predictions closer to 1.0 or 0.0. By default, the ner.teach recipe sets a bias of 0.8.

1 Like

Thank you very much Honnibal for your answer! Really interesting.