identifying diverse types of references

Right now we are using regex to identify references. For example, in this sentence, the number 9.6 should be captured:

  • "in section 9.6 of this charter"

If possible, we want to train an ML model that will do this automatically. The trouble is, there are so many different kinds of references even within the same collection of documents.

For example:

  • LR 9.3.11 R does not apply
  • article 14(3) second paragraph
  • article 5(2)(c) of the
  • in the case of paragraphs (5) and (7)

There are many more types which I am not even including. To a human it is usually easy to spot what is a reference or not. But is there a way to teach the computer to do this automatically?
Thanks!

@ :grinning: does anyone have any suggestions for us here? thank you!

Hi,

The default NER model in spaCy v3 might have a bit of trouble at this task, because the items won't tokenize very well. I think regular expressions or spaCy's matcher patterns really are a good fit for this problem, and you'll have better results getting the extraction to behave consistently that way.

If you want to use contextual cues, one way to do it would be to classify the sentences according to whether they contain a reference. This will give you another dimension of information to use, and may help you double-check your regex performance, for instance the classifier might hit cases your regex isn't covering yet. You can use the regex to "generate" the training data, so it should be quite quick.

thanks for your reply! sounds great.
can i ask another question here, or should i make a separate post?
the question is:
i want to create a rule based system for determining subsections of references.
for example: currently, from the string "in section 2(2) of the" we use regex to capture "section 2(2)". but what we really want, is to return "section 2, subsection 2", since that is what is obviously meant here.
we do have a source inidicating the hierarchy of section levels, e.g. document > part > section > subsection > paragrpah > subparargaph.
thanks for your insight!

btw i don't recall getting an email when you repsonded to this message, even though i see it is marked as me watching this thread. hmm, maybe i just missed it. wondering if others had that issue too... thx!

Strange about the email, but I can understand how it might go missing, maybe it didn't send from discourse?

Anyway. A rule-based solution makes sense for that problem I think. I've never done that myself though, so I don't have any specific advice, sorry!