How to implement Portuguese Language into Prodigy

Oswaldo · April 16, 2021, 3:25pm

My interest is to implement Portuguese Language into Prodigy and, perhaps, by creating doc.noun_chunks for pt.

ines · April 18, 2021, 1:02am

Hi! I hope I understand your question correctly! Prodigy integrates with spaCy out-of-the-box and supports Portuguese tokenization (blank:pt) and any trained spaCy models, including the Portuguese pipelines provided by us: Portuguese · spaCy Models Documentation So you can run any Prodigy recipe with your Portuguese text and a portuguese model.

This would be more related to spaCy itself. We always appreciate pull requests and you could, for instance, start off by copying the noun chunks iterator of a different language (e.g. Spanish) and adjust it for Portuguese. See this discussion for details and example PRs: https://github.com/explosion/spaCy/discussions/7006

Oswaldo · April 19, 2021, 1:49pm

Hi Ines, Thank you very much for your message. We will follow your guidelines.
Is there any recommendation to implement tagging of compound Proper Name like : "Scientific Revolution of XVI Century " ?

best regards

Oswaldo

ines · April 20, 2021, 1:43am

spaCy's Doc.noun_chunks iterators use the dependency parse and iterate over the tokens to extract base noun phrases. Here's an example of how this is implemented in Spanish and English (and you can find other implementations by looking around the source in spacy/lang):

github.com

explosion/spaCy/blob/master/spacy/lang/es/syntax_iterators.py

from typing import Iterator, Tuple, Union

from ...errors import Errors
from ...symbols import NOUN, PRON, PROPN
from ...tokens import Doc, Span


def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
    labels = [
        "nsubj",
        "nsubj:pass",
        "obj",
        "obl",
        "nmod",
        "pcomp",
        "appos",
        "ROOT",

This file has been truncated. show original

github.com

explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

from typing import Iterator, Tuple, Union

from ...errors import Errors
from ...symbols import NOUN, PRON, PROPN
from ...tokens import Doc, Span


def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
    """
    Detect base noun phrases from a dependency parse. Works on both Doc and Span.
    """
    labels = [
        "oprd",
        "nsubj",
        "dobj",
        "nsubjpass",
        "pcomp",
        "pobj",
        "dative",
        "appos",

This file has been truncated. show original

I'm not sure how well the logic translates to Portuguese, but it could be a good starting point. You may have to change the labels it uses based on the dependency labels predicted by the Portuguese parser.

A good way to start would be to write a bunch of test cases (sentences and the correct noun chunks that should be extracted). You can then test your noun_chunks iterator on that and adjust it until it covers the most frequent cases. If you've found a solution that works, feel free to submit a PR to spaCy – we'd definitely appreciate it

Topic		Replies	Views
Language PT-BR usage , spacy , solved	2	589	May 1, 2019
Does prodigy support Portuguese usage , solved	3	519	September 10, 2018
Trouble training for Portuguese usage , ner , spacy	15	2503	December 6, 2018
Problem creating a new language to serve as a base model for further improvement in Prodigy spacy , pos	3	644	August 17, 2020
Translating recipe tags to a Spacy custom pipeline component usage , spacy , coref	4	440	February 25, 2021

How to implement Portuguese Language into Prodigy

Related topics