Hi, I'm new to Prodigy and Spacy but I'm a fast lerner
I need to train a model to recognise entitiy, consisting of clusters of abbreviations (with spaces between them). It's not in English, so I trained a basic parser/tagger model from the Universal Dependancy Treebank for that language. The model has the NER pipeline but it's empty.
So I started with the manual tagging and had some 1000 paragraphs annotated. I trained the first temp model and did a second manual tagging based on the temp model (another 500 paragraphs).
Then I tried the teach recipe but the second temp model only catches the first abbreviation of the entity cluster (usually there are 2-3 abbreviations within a tagged entity).
In the process of manual tagging the full entity (cluster of abbreviations) are coloured, so I'm sure I tag what I want, however the model only recognises (and auto tags) only until the first space within the entity cluster. Is there something specific I need to do so that the model recognises the full phrase (tagged cluster of abbreviations)?
The entity pattern is roughly like this: "aa. NN, aa. NN Aaaaaa", where "a" is letter and N is a number. I'm reading the data from a txt file where I've split the sentences on newline each. Is prodigy trying to split them again, so the dot is considered end of sentence and this is where it's going wrong?
Also, I haven't used any word2vec model as a base, if I do, would that change what I experience above?