Yes, what you describe is one of the main problems with the boundaries interface at the moment. We’ve been going back and forth on this, and it’s been difficult to find the right balance of trade-offs in terms of efficiency, user experience, annotation speed and so on.
If you look at the source of the
mark function in
prodigy/recipes/ner.py, you can adjust the token slice here by using a different length or smaller spans to create overlaps between them:
for i in range(0, len(doc), 9): # document slice
span = doc[i:i+9] # focused, annotatable tokens within the slice
In theory, the interface can support any number of tokens – and up to 30 if you want to use keyboard shortcuts (shift+num for tens and shift+alt+num for twenties – e.g. shift+5 for 15).
You can also remove the
split_sentences(nlp, stream) pre-processor to disable splitting incoming texts into sentences. This means that the texts will be shown as they come in and you might need to do some pre-processing yourself to make them easier to work with or annotate. But it also gives you more control over how this is done.