consolidating unsegmented and segmented annotations

I have multiple labels that I have annotations for - some have used the automatic segmentation from ner.teach and others have had this function disabled. I would like to combine the annotations and train a single model on them. I’m guessing that just mushing them together will result in bad training and difficult evaluation.

Maybe mushing them together won’t be so bad? It’s okay if texts are different lengths. I’ve actually been meaning to play with this more as a data augmentation strategy.

One unfortunate thing about the split_sentences() method at the moment is that it doesn’t currently save the original input hash. This makes it difficult to reconstruct the original stream. We’ll definitely be fixing this in the next version.


I was exploring the -unsegmented argument and came across this thread. I was just wondering if it is now possible to use outputs of the same dataset that were tagged both with and without -U?