It's possible that this is related to the suggester function, which by default, will use an ngram range of all the available spans lengths in the data. So if you have really long spans, you'll end up with a lot of potential candidates (e.g. all possible spans between 1 and 60 tokens, which can be a lot). If you run prodigy train
with the --verbose
flag, it should show you more detailed information on the suggester function used: Span Categorization · Prodigy · An annotation tool for AI, Machine Learning & NLP
One option to prevent this would be to use a config that defines a different logic for potential span candidates via the suggester function: SpanCategorizer · spaCy API Documentation How you set this up depends on the data, but there might be common patterns that you can use instead of considering every possible combination.
The suggester functions can also integrate with Prodigy during annotation so you can ensure that only spans matching the suggester can be selected: Span Categorization · Prodigy · An annotation tool for AI, Machine Learning & NLP