Generic or specific entities and multilabel text categorization

Hi there!

I am new to Prodigy and Spacy and I am amazed about the possibilities that I see so far. Great work!
I am starting a new project which analyze R&I projects. I would like to extract the relevant entities and label the project with the subjects related but I have several questions before organizing the work:

  • Do I start by labeling generic entities for every field of R&I and, once I have those generic entities, I identify more specific entities from the generic ones (I have seen in the docs that easier to train the models if you are more generic)?

  • In order to label the documents with the subjects of research, would it be useful to find the category at the paragraph level? I have used LDA for Topic Modelling but I would like to improve the results and I don’t know if text categorization could be used for that.


Hi Maria,

Glad to hear the tools look promising! You’re definitely asking good questions, but unfortunately it’s hard to give you very good answers.

Many of these decisions are basically about making a trade-off between how quickly things can be annotated, how directly the model’s target output would answer your application’s requirements, and how accurately the model can reproduce its target output. Finding the best balance between these trade-offs for a specific problem is “fact intensive” (to borrow a term from the legal industry).

For instance, you might find that it’s quite easy to annotate with a high level of detail, as you can use a word list to do much of the work. In another project, you might find that highly detailed annotations are much too costly to produce, and you need to use a more coarse-grained scheme. Sometimes making the model output more generic doesn’t complicate the application logic that uses the output, e.g. often paragraph-level annotations is good enough. In another application, paragraphs are less useful and you’d really like word spans, but then you find the word spans can’t be predicted accurately enough, so you settle on sentence annotations.

The most general advice we can give is that you should spend some time experimenting with different ways to decompose your requirements into different models, so you can figure out the best trade-off for your problem. Avoid deciding on a single annotation strategy up-front, as you’re unlikely to make the best decisions initially, as they won’t be based on much evidence.