Linguistic features configured for a non-english model

When teaching a new blank non-english model it seams, that the combination of Prodigy and Spacy isn’t taking the result of the POS tagger into account when presenting words as a suggestion. For instance when looking for names for the “PERSON” label, and the various names is correctly tagged as “PROPN” by the POS tagger, then this feature is not used.

Is that to be expected? - Or where do I look to correct this and verify the exact features enabled by the pipeline.

spaCy's model components are separate and don't share any features, so the part-of-speech tags have no influence on the named entity recognizer and vice versa.

Using the POS tagger's predictions to bootstrap the entity recognizer is a nice idea, though, and something you could do via a custom recipe. For example, your stream could process the text, extract all spans of consecutive proper nouns and accept/reject whether they're entities or not.

Alternatively, you could also use the POS information in your match patterns to narrow in the selection. For example, suggest a token "apple" as an ORG, but only if it's tagged as a proper noun.

{"label": "ORG", "pattern": [{"lower": "apple", "pos": "PROPN"}]}

If your tagger is good, this can really speed things up and improve the selection of examples to annotate.

Thank's for the quick answer. I will look into a custom recipe then.

I got the impression, that the named entity recognizer depended on being after the pos tagger and the dependency parser as of the documentation here:

Nice to know, that they are independent as I then can skip the dependency parser for now.

1 Like