Split a ner.manual dataset, into smaller texts

Hi @dave-espinosa!

Glad to help!

Actually, from db-in documentation, I realized that the "answer" tag is optional. If it is missing, db-in will automatically add "answer": "accept" for you for records that are do not have an "answer" tag.

Because all examples in Prodigy need an "answer" value, "answer": "accept" is automatically added to all imported examples, unless specified otherwise in the data or via the --answer argument.

So while it has the error Found and keeping existing "answer" in 0 examples, it's saying that it kept your original "answer" tags for 0 examples because it replaced them for you. I can see now how that error message is a bit confusing.

What's important is that you should see in the output one line above that the same number of annotations were still loaded into the database and automatically populated as "accept".

Great question! I would suggest this helpful post about text cleaning / pre-processing philosophy in general in spaCy:

tl;dr - typically there's not a need to pre-process in spaCy.

Also:

The most important consideration with spaCy's models is that the input should resemble the training data.

The post does note that "One kind of preprocessing that can be helpful is normalizing spaces and punctuation", which may be more in line with what you're doing.

Alternatively, perhaps if you have some known cases you could also use some sort of matcher/replace to replace known issues with an entity string you know will be processed correctly. For example, if you find that "team X" is not treated correctly as an entity but "team-X" is. Instead of cleaning by adding a global rule to add "-" across some matcher rule set, you instead have a set of matcher examples to replace specific examples like changing "team X" with "team-X". The downside is it may be time consuming to compile and manage these matcher replacement pairs. But the upside is there are no unintended consequences where the global rule alters other entities.

Let me know if you have more specific questions as there may be other spacy universe tools that could help (e.g., spaczz for fuzzy matchers).