rel.manual to train ner and dependency

Hi Prodigy team,

I have upgraded the prodigy to 1.10 and annotated using rel.manual receipe around 500 annotations for both spans and relations, now i do not know how shall i train for ner using the same. Whenever i try to train for ner using :

CMD: prodigy train ner rel_compdict_v1 ./ner_v1 -es 0.2 -o ./ner_teach_v3

where 'rel_compdict_v1' is the dataset annotated usinf rel.manual receipe

it shows me followin error:

✘ Invalid data for component 'ner'

spans -> 16 -> start field required
spans -> 16 -> end field required

and when i train for parser using:
prodigy train parser rel_compdict_v1 ./ner_v1 -es 0.2 -o ./ner_teach_v3

i get:

Created and merged data for 522 total examples
Using 418 train / 104 eval (split 20%)
Component: parser | Batch size: compounding | Dropout: 0.2 | Iterations: 10
:information_source: Baseline accuracy: 0.000

=========================== :sparkles: Training the model ===========================

:heavy_check_mark: Saved model: /home/sahil/py/matterhorn/sahil/ner_teach_v3

but the model is not trained.

Please help..

Regards,
Sahil

1 Like

Hi! That's strange – this would indicate that somewhere in your data, it ended up with a span that doesn't specify a start and end :thinking: Are you able to find this example in your data and if so, can you share what it looks like?

Hi Ines,
Thanks for quick reply.
Yes i could figure it out, i do not konw how but a span with no 'start' and 'end' was present in data. So now i could train NER.

but stuck in training dependency parser with entity relations. Any help there?

Regards,
Sahil

How did you create the data? Did you ever import anything manually, or did the source data maybe include any pre-defined spans? If not and you only used the rel.manual on raw data, that's also something we should look into because it could indicate a bug with how the spans are set.

I still need to look into that. Can you export your dataset with data-to-spacy and if so, what does the result look like? Does it contain dependency labels and heads?

No i never added anything manually, the dataset was annotated through rel.manual only. To solve it i exported dataset using do-out cleaned the annotations and re imported.

I tried data-to-spacy and i got annotations like:

"tokens":[
              {
                "id":31,
                "orth":"a",
                "head":0,
                "dep":""
              },
              {
                "id":32,
                "orth":"Maryland",
                "head":-3,
                "dep":"juri"
              },
              {
                "id":33,
                "orth":"corporation",
                "head":0,
                "dep":""
              },
              {
                "id":34,
                "orth":"(",
                "head":0,
                "dep":""
              },
              {
                "id":35,
                "orth":"\u201c",
                "head":0,
                "dep":""
              },
              {
                "id":36,
                "orth":"Ashford",
                "head":0,
                "dep":""
              },
              {
                "id":37,
                "orth":"Select",
                "head":-8,
                "dep":"nick"
              },
              {
                "id":38,
                "orth":"\u201d",
                "head":0,
                "dep":""
              },
              {
                "id":39,
                "orth":")",
                "head":0,
                "dep":""
              },
              {
                "id":40,
                "orth":",",
                "head":0,
                "dep":""
              },
              {
                "id":41,
                "orth":"ASHFORD",
                "head":0,
                "dep":""
              },
              {
                "id":42,
                "orth":"HOSPITALITY",
                "head":0,
                "dep":""
              },
              {
                "id":43,
                "orth":"SELECT",
                "head":0,
                "dep":""
              },
              {
                "id":44,
                "orth":"LIMITED",
                "head":0,
                "dep":""
              },
              {
                "id":45,
                "orth":"PARTNERSHIP",
                "head":0,
                "dep":""
              },
              {
                "id":46,
                "orth":",",
                "head":0,
                "dep":""
              },
              {
                "id":47,
                "orth":"a",
                "head":0,
                "dep":""
              },
              {
                "id":48,
                "orth":"Delaware",
                "head":-3,
                "dep":"juri"
              },.......

Okay, that definitely indicates that there's potentially a bug that causes some spans to be added incorrectly without the start/end, which is strange :thinking: I'll look into this.

Do you have an example of the dependencies you've annotated? Are they all between single tokens?

Yes i am attaching a screenshot herewith

Thanks for sharing! This at least partly explains things. If this is your annotation scheme, training a regular dependency parser is not going to work well, as it expects to predict dependencies between single tokens, not entity spans. So you probably want to export your annotations and use a different model implementaion for general-purpose relation prediction, not a syntactic dependency parser.

At the moment, Prodigy will filter out all relations that are not between two single tokens, because the parser can't be updated with those. We should probably show at least one warning like "Excluding X relation annotations" when training a parser, so you know that there's a problem.

1 Like

Thank you Ines for insights.

Can you help me with your suggestions on two things as i am new to NLP (but have worked on Neural Networks for Image and Video Processing):

  1. can i reannotate the above example to depend on single tokens? if yes can you specify an example.

  2. Any model suggestions that would be good for my usecase.

Hi Ines,

So as you mentioned dependincies can only be trained between single tokens. So i reannotated the dataset making alias for all entities as single token (as shown in attached pic) yet it is not training. Any clue?

Hi! I think there might be another problem that somehow causes it to not extract the dependencies correctly – I'll take a look and we'll include a fix in the upcoming v1.10.2!

In the meantime, you could use Prodigy's db-out to export your annotations and then train with spaCy (or any other library) manually. The data will include the token indices, the labels and the heads, which is all you need to train a dependency parser.

1 Like

thank you for quick reply, we shall export and continue meanwhile... :slight_smile:

Just released Prodigy v1.10.2, which should fix this problem! There were 2 issues here:

  • The sentence segmentation in dep.correct could cause tasks to incorrectly report mismatched tokenization. (If you saw a warning here, easiest workaround would be to override the "text" property based on the tokens, to make sure it matches).
  • The parsing results weren't printed correctly, so the training all worked, you just didn't get to see the results :sweat_smile:

HI Ines,

Thank you for the fix. will test it later

Hi @ines,

I have two questions and one note related to one of your answeres above.

is there any functionality to do this in spaCy?

Do you mean that using prodigy train will filter out the examples while training, or will they also be filtered out when exporting the data using db-out or data-to-spacy?

And last:

when doing what I described here , I did not get this message. It would be very helpful though. (I have prodigy 1.10.2.)

Thanks

Not at the moment, no. So you'd have to implement that yourself.

db-out will just give you whatever is in the raw JSON data, but prodigy train and prodigy data-to-spacy will exclude annotations that can't be used, including relations spanning over multiple tokens for dependency parsing.

Ah, in this case, what you saw was a data validation error because the format was unexpected. The warning is shown if the format is alright but the annotations can't be used, e.g. because they don't map to valid tokens or because dependencies used to train the parser span over multiple tokens.

1 Like