Using Prodigy to create a NER Entity for identification of references

I.m new to space and prodigy. I would like to create a model which is able to identify and extract the references in scientific papers.

Examples of such references - mostly at the end of a paper - are:

[11] Jayasingh BB, Patra MR, Mahesh DB (2016) Security Issues and Challenges of Big Data Analytics. IEEE 204–208.
[12] Karafiloski E, Mishev A (2017) Blockchain Solutions for Big Data Challenges A Literature Review. IEEE EUROCON 2017 6–8.
[13] Karame G, Ghassan (2016) On the Security and Scalability of Bitcoin’s Blockchain. Proc 2016 ACM SIGSAC Conf Comput Commun Secur - CCS’16 1861–1862. doi: 10.1145/2976749.2976756
[14] Khan MA, Salah K (2017) IoT security: Review, blockchain solutions,
and open challenges. Futur Gener Comput Syst. doi: 10.1016/j.future.2017.11.022

Is there a way to create a model using prodigy to identify and extract the references. Best would be to identify each reference as one reference and in second steps identify author, title, journal.

Did you have an idea were I can start ?


Honestly I think regular expressions would be a better tool for this. You could probably get spaCy and Prodigy solving the problem, but it will take more effort to create the training data than you would probably spend on the regular expressions. The format of the citation fields will have been generated by some mark-up on the documents and some stylesheet — so there’s going to be a limited number of formats. spaCy and Prodigy are really set up for working with raw text — if the data is actually structured, it’s better to put the work in to parse the structure, even if the structure is fairly annoying to parse.