Can Dependencies & Relations work after Span Categorization?

Hey all, my current project intended to extract USER RIGHTS and the related access from privacy policy text. For example, the raw text is "If you wish to exercise your right to access a free copy of your data, you can contact us. If you wish to update or rectify your data, you can do this directly by accessing your account settings". The ideal goal would be {"access a free copy of your data": "contact us"; "update or rectify your data": "accessing your account settings"}.
I planned to use Span Categorization to help annotate the USER RIGHTS and the related access since NER doesn't fit my task well for the long phrases. However, to extract relations between Spans, I haven't found any official document from prodigy to refer to.
How do you think of the spans and the follow-up relation extraction? Of course, any suggestions or discussions on my task are welcomed :grinning:

Hi!

Interesting use-case :slight_smile:

I think your suggested approach of using a spancat to extract the "rights" makes sense, as this is indeed not a typical NER task. What isn't clear to me is how you're defining the actions like "contact us" or "access your account settings". Are these extracted with a spancat model as well?

With respect to relation extraction: There is builtin support in Prodigy to annotate both entities and relations at the same time with rel.manual. This does assume the entities are input for an NER algorithm, so it won't allow overlapping ones. However for annotation of your use-case, I think that's probably fine? You could parse the resulting NER annotations from Prodigy and create your own custom Doc objects storing the data in doc.spans, and then train a spancat. Let me know if you'd need further help on creating the Doc objects and serializing them to .spacy files with a DocBin. Alternatively, you can use spans.manual and then mark the annotations in the input JSONL files you're feeding into rel.manual. In that scenario, you only need to focus on the actual relations in the second step.

Training a relation extraction method with the REL annotations isn't currently a builtin functionality in spaCy or Prodigy, though you can have a look at this tutorial to get started with a custom implementation: projects/tutorials/rel_component at v3 · explosion/projects · GitHub. Note that this tutorial was written with actual named entities in mind, and not spans in doc.spans, but it should be relatively straightforward to make those adjustements in the code.

Finally, thinking a bit outside the box, I wonder whether your challenge could be recast as a textcat challenge instead? Suppose that you have all the "user right" entities/spans marked up in your sentences. If you typically only have 1 such entity per sentence, you could try to extract the "action" as a textcat category, i.e. analysing the full sentence to determine the correct action. This would make the challenge somewhat more simple, as you wouldn't need to extract the exact offsets of the "action words" like "contact us". This might be more appropriate if those contact phrases are long or not continuous in the sentence. But it'll depend on your dataset which approach will be better.

Hope that at least gives you some ideas to get started!

3 Likes

Awww, thank you very much for your detailed reply :grinning:.

The actions here are usually noun/noun phrases, or verb + noun phrases. I am currently planning to use spancat or ner to annotate the nouns/noun phrases, and the verb annotated as trigger words. How do you like it?

I will start to annotate a small number of my policies this afternoon. Will let know how it goes :raised_hands:.

Many thanks for this tutorial. It helps a lot.

Yes, it's exactly what my task is. But textcat is somehow a new term to me :sweat_smile:. Will check more on it and get back to you later (if you’d like to).

Definitely, your suggestions actually make it more clear to my project. Hope to stay connected :raised_hands:.

1 Like

Happy to help, and feel free to post any further issues here!

1 Like

Awww, it’s so nice of you :grinning:.

I started with the survey on textcat. And I came with the following methodology for my task.
Step 1: use spancat to annotate User Rights, then train a SpancatUserRights model to find User Rights.
Step 2: use textcat to annotate Access (the labels could be 3 top-level categories, and around 3 second-level categories under each top-level category. 9 categories in total), then train hierarchical textcat models to find access (about 9 binary-classification models in total ).

How do you think of this methodology?

I also thinking of another methodology which substitutes spancat with textcat in the ‘Step 1’ as follows:
Step 1: use textcat to annotate User Rights, and train a multi-classification model (around 7-8 total labels) to find user rights.
Step 2: as the former one.

I have several questions concerning it:
1 Annotate strategy: How would you compare these two methodologies? Under what case is it better to use spancat VS textcat to annotate?
2 NLP model strategy: And do you know how to choose between the hierarchical multiple binary classification models and one multi-classification model? Since textcat will introduce multi-labels for my task.
3 Any other more up-to-date solutions for this kind of task are welcomed :raised_hands:.

Looking forward to your reply.

Hi!

It's difficult to say up-front which of the two approaches will yield the best results for your specific use-case. I'd recommend trying out the annotation scheme that feels most natural to you, and see whether that fits the example sentences as you're annotating. Often, it is necessary to iterate a little bit on the guidelines in the first few days of the project. You can also train a preliminary model on a first set of annotations and see how you go, then make adjustements as needed.

The main differences between spancat and textcat are these:

  • textcat uses more context (the whole sentence) to decide on the final class. If there are several (non-continuous) parts in your sentences that provide clues to the right class, this would be more appropriate
  • spancat on the other hand requires you to annotate exact offsets of the span. When this feels awkward or even impossible to do for some sentences, that's a good indication that textcat is more suitable. If on the other hand you're able to define these spans easily and you could categorize them (as a human) without requiring much further context of the sentence, then spancat is a good fit.

With respect to hierarchical classification: Both for annotation as well as training models, it might make sense to work with top-level categories first, and then break each of them further down in a follow-up step. If you only have a handful of labels, a simple multi-class model might be fine too, though. Again, I'm afraid it's difficult to say up-front and without having worked with the data myself :wink:

1 Like

Thanks a lot for this insightful comparison.

It would help a lot if you could give some tutorials link or examples on Doc objects, .spacy files, as well as DocBin. Many thanks in davance :raised_hands: :grinning:

In case you haven't seen it yet, the spaCy docs have a section on preparing training data and creating binary .spacy files: Training Pipelines & Models · spaCy Usage Documentation It also points to the relevant API docs for the different objects.

1 Like

Many thanks. Will have a try and let you know if I run into any problems.