Thank you for such a GREAT tool! I dove right in with a text file and did 200+ annotations using ner.manual with about about 5 labels. I have several files as transcripts of recorded interviews between a researcher and a customer responding to questions. Was SO cool and did a train model and got super excited about the very good results. BUT, I then wondered if in fact I was using the wrong recipe since I'm really tagging longer text spans - like questions and responses - that aren't named entities. I finally saw you recommend textclassifier for longer spans. My question - do i have redo what I did with ner.manual, should I switch over to textannotation....I will be doing lots of annotation for interviews - want to learn best way - THANK YOU!
I don't think there's a builtin method to "convert" the data you created with
textcat type of annotations, but it should be relatively easy to implement this yourself if the data supports both views.
What I mean by that: if all your previous annotations contain just one Named Entity that you've labeled, and you feel this in fact represents the whole sentence/text snippet, than you could "extend" the label of the entity (NER) to the full text (textcat). However, if you have cases where you annotated multiple named entities in the same text snippet, then there won't be a trivial 1-1 mapping and you might need to do some manual work to resolve those.
Let's say you mainly have the first cases. Have a look at how your data looks like in JSONL format when running
prodigy db_out your_ner_database | head -1.
Now annotate 1 or 2 examples with the
textcat interface and have a look at how those examples look like in JSON - you'll see it is largely the same except for the
spans part that you have for NER and not for textcat, and for textcat you have
options and an
accept bit that stores the textcat label that was chosen. Also note that the
_view_id is different.
So basically you could write a script that converts the NER JSONL annotations into textcat JSONL annotations by setting the span label as the textcat accepted choice and tweaking the data so that it looks like it came from a textcat recipe. And then you could use
prodigy db_in to load it all back into Prodigy and you'll have a
textcat dataset without having to relabel everything manually.
Let me know if that works for you or if you need more help with this!
Thank you for the quick reply! Let me add a little more detail about what I did and see what you recommend. I fully annotated all pieces of each text using ner.manual as if I was doing text annotation. So for example I have a label 'response' and I selected the entire span of text for the response in every case. I labeled the name of the questioner as INTERVIEWER and the responder as RESPONDENT. I labeled the span of the question as QUESTION. And everything else, I labeled as OTHER. So every text is fully labeled. I say this because I think I took the approach to labeling text as opposed to named entities. I see you mentioned there's a different attribute for textcat that needs to be recognized - given the above - appreciate your advice again on best step
To be honest, that doesn't really sound like a
textcat challenge to me. Text classification is more about assigning one or multiple labels to the entire text. From what you described, you're still interested in finding the actual token boundaries of your 5 types of entities, correct?
That said, you're right that this doesn't sound like a "traditional" NER challenge like recognizing properly capitalized names or cities. But you still need something similar to NER to detect the token boundaries + label them. The good news is we've recently added support in spaCy for a new type of component called
spancat that does exactly that. You can read a bit more here: https://spacy.io/usage/v3-1#spancategorizer
spancat component has less restrictions than the NER: it allows overlapping/nested spans, for instance. Traditionally, with NER, one token can belong to only one named entity, but with the
spancat a token can belong to multiple spans. You might not need this specific feature for your specific use-case, but it might still be worth experimenting whether the
spancat component gives you better results than the
ner when training.
The upcoming Prodigy 1.11 will provide built-in support for training a
spancat component directly within Prodigy, but you might already experiment with spaCy 3.1 to check out this (experimental) functionality. I can't say up-front whether or not it'll improve your NER performance results, though. But it's worth a shot!
PS: not sure what the use if of your "OTHER" label? It feels like an unnecessary burden during annotation, and I'm not sure it actually helps with performance?
Thank you SofieVL! This is my first annotation project so I so appreciate the guidance I see now that textcat is probably not the right approach and ner wasn't a bad decision. What caught me was something in the textcat documentation that said it was better for longer spans, like sentences, paragraphs. What I'm trying to do is train a classifier on on predicting what is question text from response text in a document that has many rounds of questions/answers. Some interviews are transcipts, that have transcript 'junk' in it like timestamps, filenames, and other junk - which was the purpose of 'other'. So I'm labeling all that stuff like you would named entities. So you're right I'm definitely interested in marking ne boundaries - at least I think I am! The first training round did really well - but I had only annotated one document. I thought I would ask before I started doing more docs if I should switch to textcat.
I will definitely check out spancat!! But it sounds like ner still isn't a bad approach afterall. Thank you thank you - any thoughts welcome and I'll do some looking into spancat.