Span annotation with ner.manual -- how to make use of ner.teach

Hi,
I am annotating several datasets in German using ner.manual. I want to extract phrases from sentences that indicate certain aspects. I use the following command:

python3 -m prodigy ner.manual diss_bavkbeg de_core_news_sm /root/.prodigy/diss_bavkbegcsv --label behandlung_A,alternativheilmethoden_A,vertrauensverhältnis_A,kinderfreundlichkeit_A,betreuung_engagement_A,gesamt_empfehlung_ATP

I have a large number of sentences, but in most of them, there are not many things to annotate. ner.batch-train does not get me good results. I wondered whether I can speed up the annotation process by training a model on the accepts/rejects (accept means we have annotated something in this sentence). It would really help to get sentences that are more likely accepts. I can live without getting help finding the right spans.

I tried textcat.batch-train:

python3 -m prodigy textcat.batch-train diss_ARZT_bavkbeg_entity de_core_news_sm -n 14 -o /root/.prodigy/self-trained_models/self-trained_model_bavkbe

But it fails:

Loaded model de_core_news_sm
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/prodigy/recipes/textcat.py", line 205, in batch_train
    examples = convert_options_to_cats(examples, exclusive=exclusive)
  File "cython_src/prodigy/components/preprocess.pyx", line 277, in prodigy.components.preprocess.convert_options_to_cats
KeyError: 'label'

Am I missing something? It would be great to have support for the annotation, but I want to avoid leaving prodigy for programming an own classifier and just prepare a text file with prospective accepts which I could use as new input file.

By the way: Is there any scientific foundation for performing this supported learning? I mean: Is it okay to do so?

The idea of using textcat as a first filter definitely makes sense. I think there's a couple of small things that might need be tripping you up.

One problem is that you should use the --no-missing setting in ner.batch-train. This will tell the algorithm the annotations are complete, and that the ones you've labelled are the only ones present. This makes it easier for the algorithm to learn. Prodigy also supports some annotation modes where you can train from partial information, which isn't what you're doing.

The other thing is, the annotations you've completed are named entity annotations, so you can't straight away train with the textcat.batch-train recipe. It's really easy to add the label attribute though, based on whether you have any spans. The easiest way would be to use prodigy db-out, and then pipe it through a quick script like this:

import sys, json
# Set of labels in your data
all_labels = ["MY_LABEL1", "MY_LABEL2"]
for line in sys.stdin:
    eg = json.loads(line)
    # Get the set of labels we have a span for
    span_labels = set(span["label"] for span in eg["spans"])
    cats = {}
    for label in all_labels:
        # Mark which labels are true for the example, which are false.
        if label in span_labels:
            cats[label] = 1.0
        else:
            cats[label] = 0.0
    eg["cats"] = cats
    # Print the updated example
    print(json.dumps(eg)
1 Like

Thank you for your answer, I tried it but failed to some extend: I had to save the results to a json and use db-in to feed textcat.train. There were still some things missing, so I added "options":["MY_LABEL1", ...] and "accept":["MY_LABEL1", "MY_LABEL2"] like in texcat output files. The training successfully worked. However, when using ner.teach, I get a quite long error message:

    Traceback (most recent call last):
  File "/usr/lib/python3.5/pickle.py", line 268, in _getattribute
    obj = getattr(obj, subpath)
AttributeError: module 'thinc.linear.linear' has no attribute 'lambda1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/pickle.py", line 911, in save_global
    obj2, parent = _getattribute(module, name)
  File "/usr/lib/python3.5/pickle.py", line 271, in _getattribute
    .format(name, obj))
AttributeError: Can't get attribute 'lambda1' on <module 'thinc.linear.linear' from '/usr/local/lib/python3.5/dist-packages/thinc/linear/linear.cpython-35m-x86_64-linux-gnu.so'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/prodigy/recipes/ner.py", line 122, in teach
    model = EntityRecognizer(nlp, label=label)
  File "cython_src/prodigy/models/ner.pyx", line 178, in prodigy.models.ner.EntityRecognizer.__init__
  File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.5/copy.py", line 218, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.5/copy.py", line 174, in deepcopy
    rv = reductor(4)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 774, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.5/pickle.py", line 798, in _batch_appends
    save(x)
  File "/usr/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 774, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.5/pickle.py", line 798, in _batch_appends
    save(x)
  File "/usr/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 774, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.5/pickle.py", line 798, in _batch_appends
    save(x)
  File "/usr/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 774, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.5/pickle.py", line 801, in _batch_appends
    save(tmp[0])
  File "/usr/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 774, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.5/pickle.py", line 798, in _batch_appends
    save(x)
  File "/usr/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 96, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/usr/local/lib/python3.5/dist-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 954, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.5/pickle.py", line 627, in save_reduce
    save(state)
  File "/usr/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.5/pickle.py", line 814, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.5/pickle.py", line 840, in _batch_setitems
    save(v)
  File "/usr/lib/python3.5/pickle.py", line 506, in save
    self.save_global(obj, rv)
  File "/usr/local/lib/python3.5/dist-packages/srsly/cloudpickle/cloudpickle.py", line 704, in save_global
    return Pickler.save_global(self, obj, name=name)
  File "/usr/lib/python3.5/pickle.py", line 915, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <cyfunction LinearModel.<lambda> at 0x7fa56f166048>: it's not found as thinc.linear.linear.lambda1

Can you run pip list and check which versions of Prodigy, spaCy and Thinc you have installed?

Yes, it's currently:
spacy (2.1.4)
thinc (7.0.4)
prodigy (1.8.3)
de-core-news-sm (2.1.0)

I, so far, avoided upgrading spacy to 2.2.x because I am afraid of a change in the tokenization and problems that might occur afterwards.

-- Okay, updating to:
spacy (2.1.9)
and thus to
thinc (7.0.8)

leads to:

    ✨  ERROR: Can't find label 'behandlung_A' in model
  /root/.prodigy/self-trained_models/self-trained_model_bavkbeg
  ner.teach will only show entities with one of the specified labels. If a
  label is not available in the model, Prodigy won't be able to propose
  entities for annotation. To add a new label, you can specify a patterns file
  containing examples of the new entity as the --patterns argument or
  pre-train your model with examples of the new entity and load it back in.

Perhaps, this comes from training a model with textcat.batch-train and trying to use it with ner.batch-train? As mentioned above, this was the goal/idea.

The modified db-out data (having annotated with ner.manual) which I use for training the model look like this (inspired by output from textcat.manual):

{"text": "Die Arzthelferinnen bekommen den Mund nicht auf und wenn doch mal, sind sie oft patzig.", "answer": "reject", "_task_hash": 369570306, "tokens": [{"id": 0, "text": "Die", "end": 3, "start": 0}, {"id": 1, "text": "Arzthelferinnen", "end": 19, "start": 4}, {"id": 2, "text": "bekommen", "end": 28, "start": 20}, {"id": 3, "text": "den", "end": 32, "start": 29}, {"id": 4, "text": "Mund", "end": 37, "start": 33}, {"id": 5, "text": "nicht", "end": 43, "start": 38}, {"id": 6, "text": "auf", "end": 47, "start": 44}, {"id": 7, "text": "und", "end": 51, "start": 48}, {"id": 8, "text": "wenn", "end": 56, "start": 52}, {"id": 9, "text": "doch", "end": 61, "start": 57}, {"id": 10, "text": "mal", "end": 65, "start": 62}, {"id": 11, "text": ",", "end": 66, "start": 65}, {"id": 12, "text": "sind", "end": 71, "start": 67}, {"id": 13, "text": "sie", "end": 75, "start": 72}, {"id": 14, "text": "oft", "end": 79, "start": 76}, {"id": 15, "text": "patzig", "end": 86, "start": 80}, {"id": 16, "text": ".", "end": 87, "start": 86}], "options": [{"id": "vertrauensverh\u00e4ltnis_A", "text": "vertrauensverh\u00e4ltnis_A"}, {"id": "behandlung_A", "text": "behandlung_A"}, {"id": "kinderfreundlichkeit_A", "text": "kinderfreundlichkeit_A"}, {"id": "alternativheilmethoden_A", "text": "alternativheilmethoden_A"}, {"id": "gesamt_empfehlung_ATP", "text": "gesamt_empfehlung_ATP"}, {"id": "betreuung_engagement_A", "text": "betreuung_engagement_A"}], "_input_hash": 1155386971}

Thanks!

Okay -- I trained a model with textcat.batch-train and then ner.batch-train. Now, we are using ner.make-gold to annotate and it seems to work. However, when training with textcat, the results immediately reach an acc of 1.0, but ner stays with 0.31 at most after 30 epochs. That is okay, I don't expect any ner-model for now to get to know our data well, but am I on the right way to get active learning work for us, i.e. do I get more accepts? And has the tokenization changed from spacy 2.1.x to 2.2.x, because there seem to be problems now. Isn't the tokenizer rule-based?

We have duplicates everywhere -- does this come from ner.make-gold? (It seems like ner.gold is starting with the input file from the beginning, maybe my idea is not possible with prodigy at all.)

I think you somehow ended up with slightly messy datasets that mix annotations of different types and from different processes. Ideally, you want to create a separate dataset for separate annotation experiments. If you mix annotations from say, ner.manual (fully manual, all entities gold-standard, no missing values) with ner.teach (binary, only one span at a time, all other tokens missing values), and put them all in the same set, you won't be able to train a useful model with that because there's no way to tell which examples are gold-standard and which aren't, and you might even have a buch of conflicts.

I'd recommend just exporting the data you have, go through it in the JSON file or using a Python script and see if you can clean it up a bit. The _view_id of each record will tell you the ID of the annotation interface, so you probably want to separate examples created with ner(binary) and ner_manual (manual). Each example will also have an _input_hash so you can identify annotations created on the same input text. You can also call prodigy.set_hashes(examplee, overwrite=True) on each example to make sure you have no stale hashes, and then use the _task_hash to find duplicates.

1 Like