Create training and testing data for triples

Hi, I am new to NLP, prodigy, and spaCy so please pardon my basic questions. Our team has been given a NER model which can identify 3 custom entities. Now, we need to find the relationship between these entities and generate triples (Subject, Object, Predicate) using spaCy and prodigy. We are using the dependency parse tree to find the relationships. I have 2 questions regarding this:

  1. We have a huge dataset of research papers. Is there any way we can train our model to improve the dependency parse tree? Can we achieve this using prodigy?
  2. For our complete pipeline, we would like to create testing data of triples. Any suggestions on how we can do this using prodigy?

Hi! Your workflow sounds good :slightly_smiling_face:

Prodigy comes with a workflow called dep.teach that lets you collect training data to improve an existing dependency parser in the loop. The model will suggest possible analyses of the given parse with the most uncertain scores and you can accept and reject them. This lets you work around the otherwise pretty tedious manual annotation process for dependencies, and can work well if what you mostly care about are the most common dependency labels like nsubj or dobj (which seems to be the case for your pipeline).

After you've collected the annotations, you can run the train recipe with --binary to update from binary annotations. I'd also recommend evaluating the models you're training on your full information extraction pipeline so you can see the impact of the parser updates on the full downstream task.

I'm not 100% sure I understand the question correctly. Do you mean creating evaluation data for your model and annotating everything (entities and relationships) together?

Thank you so much for the prompt reply. This is very helpful and is exactly what we were looking for.

Yes, we would like to create some evaluation data. I was thinking of annotating some triples manually and use that to evaluate the output of our pipeline. I would love to hear your views about it and if I could make this process of creating evaluation data (triples) faster.

How many of these triples do you typically expect per sentence? Only one? And do you also want to label the original entities manually, or can you just use what's predicted by the model?

If it's typically one triple and you want to label everything manually, you could model the annotation task as a manual span labelling task (like ner.manual) with 3 labels, SUBJECT, OBJECT, PREDICATE (or similar)?

Alternatively, if your NER model is supposed to identify the spans and you're only annotating the relationships, you could stream in one task per span, plus multiple choice options for subject / predicate / object (and maybe other / none). So you're always annotating the current span. You could even use a special label for the span you're focusing on and then set a custom color – so at any given point, you can see all entities predicted in the text and the one you're currently labelling highlighted in a special colour.

Thank you for the recommendations. ner.manual seems to fit our use case better.

We typically have only one triple per line and are conveniently able to mark that using Prodigy. But about 25% of the times, we have multiple triples in a sentence.

Eg Obama was born in Hawaii and studied in Chicago.
So the triples we have are:
Obama => was born in => Hawaii
Obama => studied in => Chicago

Is there an easy way to handle such a case? One way is to duplicate such input lines so that we still end up marking one triple per sentence but I was wondering if you have any other suggestions. I am not sure how to duplicate lines easily after we have already loaded the source file and are annotating using the web tool.

Yes, I definitely see the problem here, hmm :thinking: One option would be to start by doing the 75% first and then deal with the remaining 25% with multiple relations separately.

In general, Prodigy avoids anything related to having the annotator modify the actual examples to annotate (e.g. duplicate examples) because that's something that should typically be decided on the development level. But this case is kind of an exception here. One option could be to set "instant_submit": true, which will send annotated examples back immediately. And in your recipe, you could then keep sending the same example (e.g. with a while True loop) and only break and move on to the next one if you've received the same example with a "reject" answer. So once there are no more relations in the example, you reject it empty and the server moves on to the next. (Just make sure to use distinct "_task_hash" values for the duplicates to prevent Prodigy from filtering them.)

Also, on a related note: This thread actually inspired me to do some experimentation for use cases like this to better support this type of relation annotation, while also keeping the workflow efficient. Still super experimental, but here's the one screenshot I remembered to take, complete with messy test data :sweat_smile:

The idea would be to allow streaming in pre-tokenized data with merged entities and phrases (and optionally disable all other tokens to only make the relevant spans selectable). Don't have an ETA yet for the beta, but I'll definitely share it on the forum once it's ready for testing :slightly_smiling_face:

Sorry for the late reply, Ines! I really appreciate your suggestions and it seems to fit our use case. Thanks a lot :smiley:

@aila1095 Update :tada: