If your tokenizer is implemented using custom code, you'll also need to provide the path to the code to execute. In Prodigy you can to this using the
-F flag, e.g.
By default, the
spans.manual recipe will show you all matches, since it supports overlapping spans. That said, you could make a small modification to the recipe to make the pattern matching behave like it does for non-overlapping entities: here, the (first) longest span will be preferred.
You can find the recipe in
recipes/spans.py in your Prodigy installation (run
prodigy stats) to see the path to your installation. You can then look for
allow_overlap=True and set it to
False in the call to the
One thing to consider here is that Prodigy keeps the latest batch of examples on the client before sending the answers to the server and saving it to the database, to allow easy undoing withour having to reconcile multiple conflicting annotations in the database. So you can only go back one batch because the other examples have already been sent back to the server, saved in the database (or, if you're annotating with a model in the loop, used to update the model).
If you find that you often want to go back further, you can set a larger
history_size – just keep in mind that those examples have not been sent back to the server yet.
Are you saving the annotations to the same dataset? If you're saving to the same dataset or are using the same dataset with
--exclude, you should only be seeing texts that haven't been annotated yet.
You could just make your input multi-sentence documents and maybe separate them with a newline token? You can always retain the original character offsets so it's easy to later split them into sentences again if you need to.