Split a ner.manual dataset, into smaller texts

ryanwesslen · June 24, 2022, 1:58pm

Glad to help!

Actually, from db-in documentation, I realized that the "answer" tag is optional. If it is missing, db-in will automatically add "answer": "accept" for you for records that are do not have an "answer" tag.

Because all examples in Prodigy need an "answer" value, "answer": "accept" is automatically added to all imported examples, unless specified otherwise in the data or via the --answer argument.

So while it has the error Found and keeping existing "answer" in 0 examples, it's saying that it kept your original "answer" tags for 0 examples because it replaced them for you. I can see now how that error message is a bit confusing.

What's important is that you should see in the output one line above that the same number of annotations were still loaded into the database and automatically populated as "accept".

Great question! I would suggest this helpful post about text cleaning / pre-processing philosophy in general in spaCy:

tl;dr - typically there's not a need to pre-process in spaCy.

Also:

The most important consideration with spaCy's models is that the input should resemble the training data.

The post does note that "One kind of preprocessing that can be helpful is normalizing spaces and punctuation", which may be more in line with what you're doing.

Alternatively, perhaps if you have some known cases you could also use some sort of matcher/replace to replace known issues with an entity string you know will be processed correctly. For example, if you find that "team X" is not treated correctly as an entity but "team-X" is. Instead of cleaning by adding a global rule to add "-" across some matcher rule set, you instead have a set of matcher examples to replace specific examples like changing "team X" with "team-X". The downside is it may be time consuming to compile and manage these matcher replacement pairs. But the upside is there are no unintended consequences where the global rule alters other entities.

Let me know if you have more specific questions as there may be other spacy universe tools that could help (e.g., spaczz for fuzzy matchers).

Topic		Replies	Views
Create a dataset out of many txt_files documents (Best Practice) usage , ner , best-practices	4	1819	March 30, 2021
Prodigy annotations to SpaCy train spacy	13	5616	January 31, 2018
NER for Financial Text ner	14	1606	October 25, 2023
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3014	April 17, 2023
Training NER models with synthetic data sets usage , ner , spacy , solved	13	2954	August 26, 2019

Split a ner.manual dataset, into smaller texts

Related topics