help - first process of annotation

English version below :blush:

Bonjour à tous. Je poste ce message afin d’avoir des avis sur la méthodologie que nous souhaitons mettre en œuvre pour annoter notre corpus.

Notre objectif est de constituer un corpus annoté pour un modèle de machine learning, je précise que nous annotons des documents juridiques type contrat de travail. Concernant l’annotation, nous avons besoin d’annoter les éléments suivants :

  • Labels NER
  • POS

Nous nous posons également la question : est-ce que la lemmatisation peut aider le modèle et augmenter les performances ?

Pour l’annotation de nos labels nous pensons utiliser dans un premier temps ner.manual. Nous aurions aimé qu’après l’annotation d’un certain nombre de documents les annotations nous soient suggérées. Le problème est que nous avons plus de 15 Labels qui ne sont pas les labels classiques des modèles de NER. Je pensais donc utiliser ner.correct. Que pensez-vous de ce choix ? cela permettra-t-il de nous faire gagner du temps ?

Concernant les POS, nous souhaitons avoir le POS de chaque terme dans le fichier de sortie car cela permet en général d’augmenter les performances des modèles. Il nous semble très fastidieux d’annoter tous les termes avec pos.manual. Ne serait-il pas plus judicieux d’utiliser spacy et token.pos ? Mais après il faut assembler les fichiers annotations et les POS.

Pour merger toutes nos annotations qui seront sur les mêmes textes, nous pensions utiliser data-to-spacy.

Vos conseils sont les bienvenus afin de nous aider à optimiser notre processus :wink:.

Hello everyone. I'm posting this message to get opinions on the methodology we want to implement to annotate our corpus.

Our goal is to build an annotated corpus for a machine learning model, I specify that we annotate legal documents type employment contract. Concerning the annotation, we need to annotate the following elements

  • NER labels

  • POS

We also ask ourselves the question: can lemmatization help the model and increase performance?

For the annotation of our labels we are thinking of using ner.manual at first. We would have liked that after annotating a number of documents the annotations would be suggested to us. The problem is that we have more than 15 labels which are not the classical labels of NER models. So I was thinking of using ner.correct. What do you think about this choice? Will it save us time?

Concerning the SOPs, we would like to have the SOP of each term in the output file because it usually increases the performance of the models. It seems to us very tedious to annotate all the terms with pos.manual. Wouldn't it be better to use spacy and token.pos? But then we have to assemble the annotation files and the POS.

To merge all our annotations which will be on the same texts, we thought of using data-to-spacy.

Your advices are welcome to help us optimize our process :wink:.

Hi and welcome! (And sorry, I don't know French, so I can only reply in English :sweat_smile:)

By default, the lemmas are not used as features in the models for NER or POS, so their accuracy won't make a difference. That said, if you're using rule-based lemmatization that takes the POS tags into account, the quality of the POS tags can impact the quality of the lemmas. And of course, lemmas might be useful for extracting information (e.g. to write more generic match patterns) – but that depends on your use case.

This sounds like a good approach – after doing your initial annotations, you typically want to batch train a model (e.g. using prodigy train or spaCy directly) so it can learn as accurately as possible from the initial annotations. You can then use that pretrained model as the base model in ner.correct and correct its predictions. Prodigy lets you specify the name of an installed spaCy model or a local path, so you can run prodigy ner.correct your_dataset /path/to/model etc.

I definitely think that using a model to suggest annotations can save you a lot of time :slightly_smiling_face: If you have several labels, make sure you collect enough initial examples to pretrain your model (and enough examples of every label). Maybe start with 200-500 examples (sentences or paragraphs) and then run your first training experiment.

When using NER, make sure that your entity types still follow the same conceptual idea of "named entities", otherwise your model might struggle to learn them efficiently. They don't have to be PERSON or ORG, but they should work in a similar way and describe distinct expressions like proper nouns with clear boundaries that can be determined from the local context. If that's not the case, a named entity recognition model might not be the right fit for what you're trying to do. Instead, you might want to experiment with a hybrid pipeline of more generic and classic NER labels + a text classification model.

Yes, I think it ultimately depends on how accurate the POS tags predicted by the existing model are – assuming that you're working with a language that spaCy provides a pretrained pipeline for (or where you have an existing corpus like Universal Dependencies).

A good first experiment could be to just stream in some random examples and their predicted POS tags and just annotate whether there are even errors or not. You could go through tag by tag, or remove all labels that are incorrect. At the end of it, you can calculate the error rate – if that's super low, you might not need to do much custom work. If it's higher, you can look at the particular cases the model gets wrong and collect some manual annotations for those examples.

Yes, that's perfect :+1:

Hello :grinning: ,

So we proceeded to two different annotations. A short annotation and a long one in order to compare the two results.
We then trained a model for each annotation to compare the performances. We used the following command line:
"prodigy train ner textannotationlong fr_core_news_sm --output modeletest -es 0.3 -n 20 -b 32 -d 0.2 -f 0.7"

We got better results with the short annotation.
However, we wanted to use tok 2 vec in order to train a model similar to the one we currently have.
We used the following command line, but we are having problems, can you give us the correct syntax:
"prodigy train ner textannotationlong fr_core_news_sm -t2v --output modeletest -es 0.3 -n 20 -b 32 -d 0.2 -f 0.7"

Finally, we want to give a document to the model to visualize the results. I didn't find out how to do this. Can you help us?

Thanks ! :pray:t2:

I'm not sure what the exact problems were that you were having, but one thing is that the -t2v argument expects a path to pretrained tok2vec weights, e.g. created with spacy pretrain: So it basically lets you initialise the model with a pretrained tok2vec layer (using a language modelling objective).

Here are some example weights trained on Reddit: Release tok2vec · explosion/projects · GitHub

Hello Ines,

We have downloaded the model tok2vec_cd8_model289.bin which we used the following command line:
prodigy train ner textannotationlong en_core_news_sm --init-tok2vec tok2vec_cd8_model289.bin --output modeletest -es 0.3 -n 20 -b 64 -d 0.2 -f 0.7

we have the following error:
ValueError: could not broadcast input array from shape (128,) into shape (96,)

Can you tell us how to proceed to solve the problem?
Does it work with French?

thank you !

I think the problem here is that the tok2vec weights were trained using the large English vectors (en_vectors_web_lg, which were also used to pretrain the tok2vec layer, by predicting the word's vector. So you'll need the same vectors available during training and at runtime. The en_core_web_sm model doesn't have any vectors, so that's likely the problem here.

Here's the example project that shows the different results using no vectors, vectors and vectors + tok2vec weights: projects/ner-fashion-brands at master · explosion/projects · GitHub

In general, yes – however, the pretrained tok2vec vectors from the example project I linked were trained on English text from Reddit using English vectors, so they won't be very useful for French. So you'd need to pretrain embeddings on French text and using French vectors.

If you just want to run some experiments, it might make more sense to train with spaCy directly and even use spaCy v3, which will let you initialise your model with pretrained transformer embeddings. You can use the data-to-spacy command to export your data to spaCy's JSON format, and then use spaCy v3's spacy convert to convert them to the new binary training format. You can then train with any training config, including one that uses the transformer embeddings:

Hello Ines :wave:t2: :slightly_smiling_face:,

I have a few questions that I can't find the answers to, we are a bit lost, if you could help us it would be great.

First of all: We have annotated and trained a model (we are still testing the parameters to see which ones we get the best score with.

  1. With the print-stream command we can inject a file to the model and see the results. Is it possible to export only the entities and labels in a json format?

  2. Eventually we want to use our model on a daily basis and export the results of it. Would you advise us to use spacy to train and run our model rather than using Prodigy?

Thank you,

The print commands are mostly designed for quickly previewing your data. If you want to export your annotations, you can use the db-out command, which outputs a JSON file:

Prodigy's built-in training recipes are mostly designed for quick experiments and checking whether your results are improving, and how different datasets perform. This is usually something you want to run during development.

Once you're getting serious about training your final model for production use, we'd definitely recommend training with spaCy directly. This gives you more flexibility. You can use the data-to-spacy to merge the dataset you want to use for training and output a file in spaCy's format: This is also easy to automate as part of a daily CI process etc.

Hi Ines,

so now we want to export our prodigy project to Spacy. Will we be compatible with Spacy V.3 (using data-to-spacy) ?

Thank you !

Yes, you can use the spacy convert command to output the JSON produced by data-to-spacy to spaCy's v3's new format.

1 Like

Hello Ines :wave: ,

thank you for all the information. So we'll move on to Spacy.
I would like to know how to manage the self improvement of my model? Currently we have a dataset that I annotated in Prodigy, which will be used to generate a first version of the spacy model. I will have new annotated data in the future that I want to integrate into my model to improve its performance, is this possible? If so, with which module?

Thank you very much for helping us in our project! :smile:

Once you have trained a preliminary model, you can load it into recipes like ner.correct or ner.teach to create more annotations that correct the model's predictions. You can then retrain using all your collected annotations and check the results to see if your model is improving.

Hi ines,

I know the command ner.correct and ner.teach. I have already used ner.teach.
However, my data will already be annotated, so I don't have to. I just need to find a way to add them to my already trained model in spacy directly with the first data? In this case do I also need to use ner.correct and ner.teach?
What I have a problem with is once I generate a spacy model (outside of prodigy), I don't see how to "connect it to Prodigy to continue annotating and give it new data to improve its performance? Should I use project.yml ?

Thanks !

If you want to load your spaCy model into Prodigy in a recipe, you can just use the path to it as the argument. So instead of en_core_web_sm or blank:en, you'd use the path to your trained model.

If you just want to update it with more examples, you can use prodigy train with all annotations or with the new annotations and your existing model as the base model. Or you can export your annotations from Prodigy and update your model with spaCy.

1 Like

Hello Ines, :wave: :smile:

We are currently developing our spacy model with the data scientist on our team.
We are starting to think about a new project: we want to automatically detect certain paragraphs of documents.
More precisely, we have documents for which we need to detect the presence of certain paragraphs (of different natures, hence the classification of texts). So we thought of using automatic text classification.
However, several questions come to my mind: should our corpus contain only the paragraphs that the model should learn to recognize? Or should we create a corpus with the complete contracts but for which only certain paragraphs interest us?

Thanks a lot! :pray:

If your goal is to predict whether a paragraph is "relevant" or not, you definitely want to train it on examples of relevant paragraphs and paragraphs that are not relevant – otherwise, there's no way it can learn to make this distinction. If your runtime model will get to see all kinds of paragraphs, but it was only trained relevant paragraphs you pre-selected, it will struggle to predict anything meaningful for texts that are very different from what it was trained on.

1 Like