How to implement Portuguese Language into Prodigy

My interest is to implement Portuguese Language into Prodigy and, perhaps, by creating doc.noun_chunks for pt.

Hi! I hope I understand your question correctly! Prodigy integrates with spaCy out-of-the-box and supports Portuguese tokenization (blank:pt) and any trained spaCy models, including the Portuguese pipelines provided by us: Portuguese · spaCy Models Documentation So you can run any Prodigy recipe with your Portuguese text and a portuguese model.

This would be more related to spaCy itself. We always appreciate pull requests and you could, for instance, start off by copying the noun chunks iterator of a different language (e.g. Spanish) and adjust it for Portuguese. See this discussion for details and example PRs: https://github.com/explosion/spaCy/discussions/7006

Hi Ines, Thank you very much for your message. We will follow your guidelines.
Is there any recommendation to implement tagging of compound Proper Name like : "Scientific Revolution of XVI Century " ?

best regards

Oswaldo

spaCy's Doc.noun_chunks iterators use the dependency parse and iterate over the tokens to extract base noun phrases. Here's an example of how this is implemented in Spanish and English (and you can find other implementations by looking around the source in spacy/lang):

I'm not sure how well the logic translates to Portuguese, but it could be a good starting point. You may have to change the labels it uses based on the dependency labels predicted by the Portuguese parser.

A good way to start would be to write a bunch of test cases (sentences and the correct noun chunks that should be extracted). You can then test your noun_chunks iterator on that and adjust it until it covers the most frequent cases. If you've found a solution that works, feel free to submit a PR to spaCy – we'd definitely appreciate it :blush: