retraining a dependency parser or just changing the def. of noun chunks ?

Hi,

I’ve been using the dependency parser quite extensively in english. I wanted to extend my code to be able to cope with french, but I noticed that noun chunks behave quite differently:
“He gave an apple to Paul” gives 2 noun chunks ‘an apple’ and ‘Paul’
“Il donna une pomme à Paul” gives a single noun chunk “une pomme à Paul”. We should be getting 2 (“pomme” and “Paul”).

I suspect this comes from the dependency tree (english “to” is between “apple” and “paul” while in french “à” is child of “Paul”).

Does it mean English ClearNLP dependency tree and UDP are defined in such a different way that the definition of noun chunk should be changed ? Is it just a problem with training (i.e this dependency is not parsed properly and more training will solve the problem)?

Any insight appreciated.

Hey,

I don’t know French grammar very well, so I’m not sure what the target parse for the UD would be. I would expect this would be an issue of the noun chunk rules though, rather than the parse quality. That should mean it’s pretty easy to fix. You can set a different function for the noun chunk iteration, so you can customise the rules.

Here are the current rules for French: https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/syntax_iterators.py . You can see the FrenchDefaults language class here: https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/init.py

The simplest way to customise the noun chunks logic would be to write to the French.Defaults.syntax_iterators class variable before loading your model.