help - first process of annotation

English version below :blush:

Bonjour à tous. Je poste ce message afin d’avoir des avis sur la méthodologie que nous souhaitons mettre en œuvre pour annoter notre corpus.

Notre objectif est de constituer un corpus annoté pour un modèle de machine learning, je précise que nous annotons des documents juridiques type contrat de travail. Concernant l’annotation, nous avons besoin d’annoter les éléments suivants :

  • Labels NER
  • POS

Nous nous posons également la question : est-ce que la lemmatisation peut aider le modèle et augmenter les performances ?

Pour l’annotation de nos labels nous pensons utiliser dans un premier temps ner.manual. Nous aurions aimé qu’après l’annotation d’un certain nombre de documents les annotations nous soient suggérées. Le problème est que nous avons plus de 15 Labels qui ne sont pas les labels classiques des modèles de NER. Je pensais donc utiliser ner.correct. Que pensez-vous de ce choix ? cela permettra-t-il de nous faire gagner du temps ?

Concernant les POS, nous souhaitons avoir le POS de chaque terme dans le fichier de sortie car cela permet en général d’augmenter les performances des modèles. Il nous semble très fastidieux d’annoter tous les termes avec pos.manual. Ne serait-il pas plus judicieux d’utiliser spacy et token.pos ? Mais après il faut assembler les fichiers annotations et les POS.

Pour merger toutes nos annotations qui seront sur les mêmes textes, nous pensions utiliser data-to-spacy.

Vos conseils sont les bienvenus afin de nous aider à optimiser notre processus :wink:.


Hello everyone. I'm posting this message to get opinions on the methodology we want to implement to annotate our corpus.

Our goal is to build an annotated corpus for a machine learning model, I specify that we annotate legal documents type employment contract. Concerning the annotation, we need to annotate the following elements

  • NER labels

  • POS

We also ask ourselves the question: can lemmatization help the model and increase performance?

For the annotation of our labels we are thinking of using ner.manual at first. We would have liked that after annotating a number of documents the annotations would be suggested to us. The problem is that we have more than 15 labels which are not the classical labels of NER models. So I was thinking of using ner.correct. What do you think about this choice? Will it save us time?

Concerning the SOPs, we would like to have the SOP of each term in the output file because it usually increases the performance of the models. It seems to us very tedious to annotate all the terms with pos.manual. Wouldn't it be better to use spacy and token.pos? But then we have to assemble the annotation files and the POS.

To merge all our annotations which will be on the same texts, we thought of using data-to-spacy.

Your advices are welcome to help us optimize our process :wink:.

Hi and welcome! (And sorry, I don't know French, so I can only reply in English :sweat_smile:)

By default, the lemmas are not used as features in the models for NER or POS, so their accuracy won't make a difference. That said, if you're using rule-based lemmatization that takes the POS tags into account, the quality of the POS tags can impact the quality of the lemmas. And of course, lemmas might be useful for extracting information (e.g. to write more generic match patterns) – but that depends on your use case.

This sounds like a good approach – after doing your initial annotations, you typically want to batch train a model (e.g. using prodigy train or spaCy directly) so it can learn as accurately as possible from the initial annotations. You can then use that pretrained model as the base model in ner.correct and correct its predictions. Prodigy lets you specify the name of an installed spaCy model or a local path, so you can run prodigy ner.correct your_dataset /path/to/model etc.

I definitely think that using a model to suggest annotations can save you a lot of time :slightly_smiling_face: If you have several labels, make sure you collect enough initial examples to pretrain your model (and enough examples of every label). Maybe start with 200-500 examples (sentences or paragraphs) and then run your first training experiment.

When using NER, make sure that your entity types still follow the same conceptual idea of "named entities", otherwise your model might struggle to learn them efficiently. They don't have to be PERSON or ORG, but they should work in a similar way and describe distinct expressions like proper nouns with clear boundaries that can be determined from the local context. If that's not the case, a named entity recognition model might not be the right fit for what you're trying to do. Instead, you might want to experiment with a hybrid pipeline of more generic and classic NER labels + a text classification model.

Yes, I think it ultimately depends on how accurate the POS tags predicted by the existing model are – assuming that you're working with a language that spaCy provides a pretrained pipeline for (or where you have an existing corpus like Universal Dependencies).

A good first experiment could be to just stream in some random examples and their predicted POS tags and just annotate whether there are even errors or not. You could go through tag by tag, or remove all labels that are incorrect. At the end of it, you can calculate the error rate – if that's super low, you might not need to do much custom work. If it's higher, you can look at the particular cases the model gets wrong and collect some manual annotations for those examples.

Yes, that's perfect :+1: