I have annotated a dataset with ner.manual
, trained a model and annotated more data using ner.correct
.
Is it now possible to also get the POS tags for the tokens using db-out?
I have annotated a dataset with ner.manual
, trained a model and annotated more data using ner.correct
.
Is it now possible to also get the POS tags for the tokens using db-out?
Just to confirm, have you been using ner.manual
and ner.correct
to label part of speech tags that you'd now like to export? Or are you interested in exporting the labelled entities together with the POS information that spaCy might predict?
In the case of the former; db-out
will be able to return all annotations.
In the case of the latter, you'll need to use a spaCy model to attach the POS information after exporting the data via db-out
. Alternatively, you could also train a spaCy model based on en_core_web_md
(assuming you use English) which can already detect POS tags. You can train a pipeline using it as a starting point to detect the entities that you've labelled and then you'll have a model that can predict both.
I want to export the labeled entities including the pos tags. If I use the en_core_web_md
, then the default entities are also included there or? If I want to determine the Pos tags of the entities afterwards, I have to ensure an equal tokenizing. I have trained with blank:en
and on it and use the default tokenizer.
The English models should all be using the same tokenizer unless you've customised it.
Here's how I might go about it.
prodigy train --ner <datasetname> --base-model en_core_web_md --lang en <folder-out>
I'm using the train command here that uses the en_core_web_md
pipeline as a starting point. Once the model is done training, I can load it.
import spacy
# Load the trained model
nlp = spacy.load("<folder-out>/model-best/")
# Run the model
doc = nlp("do you speak Python")
# Confirm the POS
[t.pos_ for t in doc]
# ['AUX', 'PRON', 'VERB', 'PROPN']
# Confirm ents
doc.ents
# (Python, )
This way, you can re-use the POS from a base model while I'm adding the NER from the Prodigy labels. Note that these POS estimators are statistical predictions though. They are predictions that will be wrong once in a while.