Training a relation extraction component

Here is the proof that there is still a major issue with the component or its configuration (this issue was mentionned multiple times in the thread) :

This annotation :

{"text":"The equipment, the Supplier and the processes applied must comply with AAAAAAA and any applicable documents","_input_hash":964968797,"_task_hash":-1368578864,"_is_binary":false,"spans":[{"start":4,"end":13,"token_start":1,"token_end":1,"label":"HARDWARE"},{"start":19,"end":27,"token_start":4,"token_end":4,"label":"ROLE"},{"start":36,"end":45,"token_start":7,"token_end":7,"label":"PROCESS"},{"start":71,"end":78,"token_start":12,"token_end":12,"label":"STANDARD"},{"start":87,"end":97,"token_start":15,"token_end":15,"label":"CONDITION"},{"start":98,"end":107,"token_start":16,"token_end":16,"label":"DOCUMENT"}],"tokens":[{"text":"The","start":0,"end":3,"id":0,"ws":true,"disabled":false},{"text":"equipment","start":4,"end":13,"id":1,"ws":false,"disabled":false},{"text":",","start":13,"end":14,"id":2,"ws":true,"disabled":false},{"text":"the","start":15,"end":18,"id":3,"ws":true,"disabled":false},{"text":"Supplier","start":19,"end":27,"id":4,"ws":true,"disabled":false},{"text":"and","start":28,"end":31,"id":5,"ws":true,"disabled":false},{"text":"the","start":32,"end":35,"id":6,"ws":true,"disabled":false},{"text":"processes","start":36,"end":45,"id":7,"ws":true,"disabled":false},{"text":"applied","start":46,"end":53,"id":8,"ws":true,"disabled":false},{"text":"must","start":54,"end":58,"id":9,"ws":true,"disabled":false},{"text":"comply","start":59,"end":65,"id":10,"ws":true,"disabled":false},{"text":"with","start":66,"end":70,"id":11,"ws":true,"disabled":false},{"text":"AAAAAAA","start":71,"end":78,"id":12,"ws":true,"disabled":false},{"text":"and","start":79,"end":82,"id":13,"ws":true,"disabled":false},{"text":"any","start":83,"end":86,"id":14,"ws":true,"disabled":false},{"text":"applicable","start":87,"end":97,"id":15,"ws":true,"disabled":false},{"text":"documents","start":98,"end":107,"id":16,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":16,"child":15,"head_span":{"start":98,"end":107,"token_start":16,"token_end":16,"label":"DOCUMENT"},"child_span":{"start":87,"end":97,"token_start":15,"token_end":15,"label":"CONDITION"},"color":"#96e8ce","label":"IN_CONDITION"},{"head":1,"child":12,"head_span":{"start":4,"end":13,"token_start":1,"token_end":1,"label":"HARDWARE"},"child_span":{"start":71,"end":78,"token_start":12,"token_end":12,"label":"STANDARD"},"color":"#ffdaf9","label":"COMPLY_WITH"},{"head":4,"child":12,"head_span":{"start":19,"end":27,"token_start":4,"token_end":4,"label":"ROLE"},"child_span":{"start":71,"end":78,"token_start":12,"token_end":12,"label":"STANDARD"},"color":"#ffdaf9","label":"COMPLY_WITH"},{"head":7,"child":12,"head_span":{"start":36,"end":45,"token_start":7,"token_end":7,"label":"PROCESS"},"child_span":{"start":71,"end":78,"token_start":12,"token_end":12,"label":"STANDARD"},"color":"#ffdaf9","label":"COMPLY_WITH"},{"head":1,"child":16,"head_span":{"start":4,"end":13,"token_start":1,"token_end":1,"label":"HARDWARE"},"child_span":{"start":98,"end":107,"token_start":16,"token_end":16,"label":"DOCUMENT"},"color":"#ffdaf9","label":"COMPLY_WITH"},{"head":4,"child":16,"head_span":{"start":19,"end":27,"token_start":4,"token_end":4,"label":"ROLE"},"child_span":{"start":98,"end":107,"token_start":16,"token_end":16,"label":"DOCUMENT"},"color":"#ffdaf9","label":"COMPLY_WITH"},{"head":7,"child":16,"head_span":{"start":36,"end":45,"token_start":7,"token_end":7,"label":"PROCESS"},"child_span":{"start":98,"end":107,"token_start":16,"token_end":16,"label":"DOCUMENT"},"color":"#ffdaf9","label":"COMPLY_WITH"}],"answer":"accept","_timestamp":1687273794}

triggers this issue :

/venv/lib/python3.10/site-packages/thinc/layers/reduce_mean.py", line 19, in forward
    Y = model.ops.reduce_mean(cast(Floats2d, Xr.data), Xr.lengths)
  File "thinc/backends/numpy_ops.pyx", line 318, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
AssertionError

as mentionned earlier in the thread and still unresolved. It doesn't look like there's any issue with the data, if I'm not wrong.

Hi Stella,

It's true that we appreciate the effort to report problems with the software, and of course it's not always easy to work through things and figure out whether there's a problem with the underlying implementation, the configuration, the data, or perhaps a step in between.

I want to set that aside for a second and address how this seems to have affected your satisfaction with the Prodigy software. I'm sorry if you feel that the marketing of the product doesn't match the capabilities. We'd certainly be happy to issue you a refund if you feel that the tool isn't able to do what you need.

However, I think the heart of this issue comes down a question of scope. Prodigy is an annotation tool, that integrates well with our open-source project spaCy. So we can talk about the relation annotation part (in Prodigy), and the training of the relation extraction component (in spaCy). My understanding is that it's the latter that you've been having trouble with, and in fact the relation extraction component is still marked experimental in spaCy. We're really eager to get it completed and have a smoother user experience around it, but the NLP field moves extremely quickly, and we have a lot to do to keep up with the latest developments. Some of our roadmap has slipped as a result. I hope you see the value in the overall software we provide, and can be sympathetic to that.

Regarding this part of your question:

(because why could you annotate data if it was not meant to train a model ?)

Many Prodigy users export the annotations for use with other open-source software. You're free to use any of the other relation extraction projects that have been published as open-source software --- you don't have to use ours.

Hi Stella,

To help you more efficiently with the problems you've been having with training spaCy's experimental relation extraction component, I suggest that you

  1. Open a new issue on the spaCy discussion forum, focusing each thread there on one specific issue & error message at the time, to ensure we focus & resolve issues more effectively one by one
  2. For each error you run into, provide the full code, config & CLI command you used, as well as a full stack trace of the error message.

For the AssertionError that has now been popping up again, it's extremely useful that you've been able to pin-point a specific data record that caused it. In combination with the full code as described in (2), it will allow us to (hopefully) replicate the exact issue and help you debug your code.

Hi,

Thanks for your answer, Matthew.

I understand that solving my problem can be a difficult task, more particularly because I can't share my code / whole data. If it was possible to give access to my repository to someone, it would have been easier to fix my issue.

I've created a thread on SpaCy's discussion forum, on your and Sofie's advice. But I'm afraid my error / environment is too specific to get an answer, as the component updates were not pushed in Sofie's repository, so no random user could easily help me and reproduce my error.

Furthermore, it doesn't seem that the data I was able to isolate triggers the error anymore. I've brought some positive changes to the code. Since, the error looks triggered only when evaluation is needed, so when there are dev and test sentences. In some cases, depending on the training data length, the entities in the annotations.jsonl file can't be parsed.

Considering the refund, if I don't find a solution, it might be indicated, but I don't know yet which tool I could use to solve my initial issue, and my (small) team has spent 4 months on that, which is a considerable amount of time. You mentionned open source projects for relation extraction with Prodigy annotations, but I didn't find them. Could you be more specific ?

I still think that a functional and easy-to-use relation extraction component is a very important feature, even if I understand that the NLP field moves quickly and that you have other things to tackle.

I hope I can find a quick solution to my issue.

I've added the updated code in this repository. I've deleted assets, data and training folders as well as erased my labels in SYMM_LABEL and DIRECTED_LABELS in scripts/parse_data_generic.py file to ensure data confidentiality. If anyone can reproduce my error with their own data, please let me know.