80 Entities ner.manual

Hi guys,

We are working in a project that contains 80 entities. When we try to use ner.manual we only see the most of the entities not the text. Any idea how to solve this?

Thanks,
Victor

Hi Victor!

By 80 entities do you mean 80 potential labels for an annotator to choose from or a single text example has 80 highlighted entity spans in the text?

Typically we'd recommend < 20 potential entity labels (ideally < 10). If you have 80 choices it's likely there is some overlap between types and this would be hard for a model to learn anyway. If you are able to share the list of entity labels, we could make a recommendation on how to restructure this into fewer options.

If a single example has 80 highlighted entity spans, we recommend annotating a single sentence at a time. This is the default so if this is the case, make sure the --unsegmented option is not set when you run the ner.manual recipe.

Yeah, I agree, if your goal is to annotate named entities or similar spans and you ended up with this many labels, we'd typically recommend rethinking your label scheme and structuring your task differently. You're making your life a lot harder this way, and it'll be much more difficult to create consistently annotated data with enough coverage, and your model will be much less likely to learn from it effectively. Also see this thread for more background and suggestions:

Thanks for replay.

We trained a model with around 65 entities . It took us more than a year and two months to label more than 57k sentences with 65 entities and we got great F,P and Recall scores ( We used 3 validation sets to make sure we were doing it correctly) . For that project we used a different software for labeling and it was a very difficult task. A colleague recommended to use prodigy and use the patterns file of prodigy to annotate the text for this new project. As it will be way faster.

Prodigy seems to do good job. However when I use the command --label to fix the labels in the text we cannot see the text. We can only see the labels in the interface.

The project that we are working on is very

Hi Kabir Khan,

Yes, I mean 80 labels.

We trained a model with around 65 entities . It took us more than a year and two months to label more than 57k sentences with 65 entities and we got great F,P and Recall scores ( We used 3 validation sets to make sure we were doing it correctly) . For that project we used a different software for labeling and it was a very difficult task. A colleague recommended to use prodigy and use the patterns file of prodigy to annotate the text for this new project. As it will be way faster.

Prodigy seems to do good job. However when I use the command --label to fix the labels in the text we cannot see the text. We can only see the labels in the interface.

The project that we are working on is very

Oh wow, that's a very solid annotation effort :sweat_smile: And definitely interesting that it works well at a large scale like this.

Ah, I think with this many labels, the default header (which is sticky) just ends up covering them. But you can add a line of CSS to your global_css that sets a maximum height and makes the container scrollable. See here for details:

Thank Ines!! We will try it!

Yes, we did rethink our label scheme and structure many many times :sweat_smile: . The problem was that the broader we made it more problems we had at our end goal. This project is for a very specific field in the medical field and the language is very specific too. Therefore, we needed to more specific and try and it works! We have a entity extraction script and it does great for our end goal.

You guys need to be very proud of the work you have done with spacy and prodigy.

P.S. We hope to share all our work in spacy Universe in few months

2 Likes

Thanks so much, and yes, definitely keep us updated on the progress, can't wait to check out your final project :raised_hands:

Yeah, this is definitely the #1 message and recommendation we have for people. 1 year of work and 50k+ annotations is a big commitment, so you definitely want to make sure you're on the right track before you get to work :sweat_smile:

There are a lot of use cases we've seen where a large label scheme like this just wasn't a good fit (hierarchical categories, distinctions that are very difficult to learn for a model), or where there was a more efficient solution that required less work for the same results. So our first reaction is usually: "Are you sure you don't want to structure your label scheme differently?" But of course, there are always exceptions, and I'll bookmark this thread as a reference if this comes up again in the future. It also gives a good ballpark of the number of annotations you need in order to make a label scheme like this work (if the label scheme is designed well).