ner.manual - simple usage

Dear Support,
I have a newbie questions about the usage of ner.manual.

Q1: In the ner.manual I have to annotate a term (ex. "gigigi") each time I see it on different tasks? Maybe this term is replicated many times in the document. It is useful for disambiguation?

Q2: In the ner.manual I have around 600 phrases to annotate, but the annotator received just 120 phrases. I think there is not any active learning on ner.manual, or I'm wrong?

Thanks in advance for your support.

All the best

C.

Yes, if you're labelling manually, you usually want to label every instance of the term "gigigi" every single time. Named entity recognition is context-dependent, so you want your data to include the entities in a variety of contexts. For ambiguous entities, this is especially important.

Because labelling everything manually can be kinda annoying and tedious, Proidgy tries to make this easier with the semi-automated recipes like ner.teach (uses active learning) or ner.match (without active learning) that will suggest candidates and let you say yes or no.

No, the default ner.manual recipe should stream in all examples as they come in and not skip any. By "received", do you mean that they annotated everything, but you only have 120 tasks in the dataset? Some possible explanations could be:

  • Does you data contain any duplicate sentences? If so, Prodigy will filter those out.
  • If the annotator refreshes the browser, the Prodigy app will request the next batch of tasks – and until you've received all answers and the session is over, Prodigy can't know whether a task needs to be sent out again. This thread has more details on this and suggestions for a solution.
  • Always make sure to save your progress in the web app after you're done annotating. Otherwise, you might lose the last batch of annotations when you close the browser.

Thanks @ines for the answers!

I have another couple of questions:

Q3: I’ve checked also on your demo server on https://prodi.gy/demo?view_id=ner_manual:

  • On your demo the annotator has not any “No tasks available” message till the 100% progress, and that is OK.
  • On our server the progress bar has infinite symbol (and I would like to have the same progress bar of the demo), how I can setup it?

The source file is a TXT file, for example:

prodigy ner.manual export3 it_core_news_sm export1.txt --label "PERSON, ORG, LOC" &

Q4: How to add the license on the product? Just download the file you provided via email and use it?

Thanks again for your support, and any suggestions would be really appreciated.

All the best.

C.

Ah, I think I know what's going on here: By default, Prodigy streams are generators and if a file can be read in line-by-line, Prodigy will do so and start yielding out tasks immediately. Generators have no length, because they don't know how many items there are in total. And if Prodigy doesn't know how many items there are in total, it can't display the progress.

A simple thing you could do is edit the recipe in recipes/ner.py and find the following line:

stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

... and replace it with this:

stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
stream = list(stream)

(To find the source of your Prodigy installation, you can run the following:

python -c "import prodigy; print(prodigy.__file__)"

Do you mean the software license? The Prodigy library doesn't connect to the internet or otherwise "phones home", so you don't need to enter the license key when you use the software. However, you should keep it safe for future reference.

Many thanks @ines for your reply, and I've modified the ner.py as your instructions:

But without success, for example on ner.manual receipe I add the stream = list(stream) as follow:

@recipe('ner.manual',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        label=recipe_args['label_set'],
        exclude=recipe_args['exclude'])
    def manual(dataset, spacy_model, source=None, api=None, loader=None,
           label=None, exclude=None):
    """
    Mark spans by token. Requires only a tokenizer and no entity recognizer,
    and doesn't do any active learning.
    """
    log("RECIPE: Starting recipe ner.manual", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))
    # Get the label set from the `label` argument, which is either a
    # comma-separated list or a path to a text file. If labels is None, check
    # if labels are present in the model.
    labels = label
    if not labels:
        labels = get_labels_from_ner(nlp)
        print("Using {} labels from model: {}"
              .format(len(labels), ', '.join(labels)))
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    stream = list(stream)

Happy to know if there is something wrong in this edited file ner.py, and if I need to modify the interface as well to make it working? :sunny:

Another useful thing IMHO to add on the ner.manual interface is the total number of tasks in progress till the completion, like "3% of 150".

Again, thanks for your support.

All the best.

C.

Hi @ines if you have any ideas or suggestions, or just a pointer to the documentation about this topic, please let me know.

All the best

C.

Sorry, I think I told you to add this at the wrong position: the idea is that the recipe needs to return a stream that exposes a __len__ (i.e. a length, which is the case for regular lists, but not generators). So try converting the stream as late as possible in the recipe:

stream = add_tokens(nlp, stream)
stream = list(stream)
1 Like

Thanks a lot @ines ! Works like a charm https://imgur.com/HBL73Tu

I really appreciate your support.

All the best

C.

1 Like