Can the NER recognize groups of words? Should I use patterns?

Hi,

I have so far unsuccessful at getting Spacy to recognize legal citations as separate entities. These are the entities I want it to find:

  • Arizonans for Official English v. Arizona, 520 U.S. 43 (1997)
  • Bell v. Wolfish, 441 U.S. 520 (1979)
  • City of L.A. v. Lyons, 461 U.S. 95 (1983)
  • City of Ontario, Cal. v. Quon, 130 S.Ct. 2619 (2010)

I’ve used NER manual so far. Would it be better to use something else? The classifier is not ideal bc it wouldn’t pull the citations from text.

Where should I go from here? Here’s the result of training my model.

Loaded model en_core_web_sm
Using 50% of accept/reject examples (108) for evaluation
Using 100% of remaining examples (210) for training
Dropout: 0.2 Batch size: 5 Iterations: 10

BEFORE 0.000
Correct 0
Incorrect 18
Entities 407
Unknown 0

LOSS RIGHT WRONG ENTS SKIP ACCURACY

01 102.558 0 18 1841 0 0.000
02 76.975 0 18 1865 0 0.000
03 66.574 0 18 2040 0 0.000
04 57.477 0 18 1988 0 0.000
05 50.854 0 18 1868 0 0.000
06 42.160 0 18 1979 0 0.000
07 32.963 0 18 1972 0 0.000
08 33.265 0 18 1854 0 0.000
09 30.458 0 18 1881 0 0.000
10 30.371 0 18 2006 0 0.000

Correct 0
Incorrect 18
Baseline 0.000
Accuracy 0.000

I suspect this type of entity will be difficult to learn, as it’s quite long. You might find that patterns do better. However, I have to say your dataset is very small — so it’s difficult to conclude much from your experiment. It could be that the same approach does succeed if you give it 10-20 times more data.