Hi @qu-genesis,
I think the uncertainty sampling with multiple labels in textcat.teach
introduced more ambiguous cases that confused the model's previously learned patterns. This may be related to fact that the sampling might have been suboptimal due to working with multiple labels while updating only one label at a time (as discussed in my previous post)
I would definitely recommend working with one label at a time to try to gather more examples from underrepresented classes. You will then be able to merge them into the format expected by spaCy texcat_multilabel
by exporting them with data_to_spacy
.
Some users tried implementing texcat.teach
with multiple labels so that the model gets updated on all labels with each non-exlusive annotations but I'm not sure how would that affect the active learning effectiveness. It might be that it will take longer to converge in comparison to doing focused one-label-at-a-time sessions.
Re: number of examples
Definitely more is needed, I would recommend following spaCy general purpose advise here. You can also run spacy data debug
once you've converted your data to spaCy DocBin
format to get some more structural insights on your dataset.
Re: document length & architecture
All the architectures use some sort of pooling of the token vector representations. This allows to process documents that are arbitrarily long, but you're definitely right in thinking that it might lead to some context dilution. Now, some techniques are more prompt to this context dilution than others. BOW just pools the n-gram representations without taking token position into account which makes it least appropriate for long texts. For a CPU solution, the ensemble model would probably be better as it uses attention in the textcat
component, but if you can work on a GPU then a transformer-based architecture would definitely handle better long-distance relationships in text thanks to self-attention.
That said, splitting the text into paragraphs and preprocessing the input to select the most representative parts makes a lot of sense and is spaCy's developers general advice due to more efficient memory use.