Split a ner.manual dataset, into smaller texts

Hello everyone,

I am using Prodigy to train a NER model, with 4 different labels. In my first approach, I did not obtain good-enough results, and after checking this flowchart, I realized that maybe, it would be worth trying to train another model, with smaller texts this time. I used Prodigy's ner.manual to build my original training dataset (with "large" texts), obtaining a jsonlines file as result, in which each sample is a dictionary with text, _input_hash, _task_hash, _is_binary, tokens, _view_id, spans (not available if the text was rejected or ignored), answer, _timestamp, ant their corresponding values . I then built my own function (as I did not find any function in either Prodigy or spaCy), to "split" the text from the previously mentioned dataset, into smaller ones (while modifying accordingly the Spans on the labeled entities). What I have now, looks like this (code borrowed from your "Spacy Advanced Course, chapter 4"):

import spacy
from spacy.tokens import DocBin, Span

nlp = spacy.blank("en")

# New "smaller" Docs with updated entity spans

doc1 =nlp("Python and Java, are two very important assets in the role.")
doc1.ents = [Span(doc1 , 0 , 1 , label="LABEL1"), Span(doc1, 2 , 3 , "LABEL2")]

doc2 = nlp("As a data scientist, you are expected to do that and beyond.")
doc2.ents = [Span(doc2, 2 , 4 , label="LABEL3")]

doc3 = nlp("I need a new phone! Any tips?")
doc3.ents = []

docs = [doc1, doc2, doc3]  # With all the docs...

dataset = DocBin(docs=docs)
dataset .to_disk("./dataset.spacy")

But now, I am a bit lost, more specifically:

  1. How to "convert" that dataset.spacy file, into something compatible with Prodigy? Is there a more direct approach in what I am doing here? (I could convert what I currently have into ner or ner_manual instead, but then I don't know how to use the generated jsonlines, with db-in (the documentation does not include any example, and the Description is not quite self-explanatory).
  2. Is there any "recommended ratio" between samples "with and without" entity spans?

Hi @dave-espinosa!

Thanks for your question!

I think I found a solution for #1:

I ran your code (fyi you're missing from spacy.tokens import Span) and ignored the last two lines as you can skip the intermediate step of saving the binary to disk.

import srsly

examples = []  # examples in Prodigy's format
for doc in docs: # loop through docs
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans, "answer": "accept"})

srsly.write_jsonl("data.jsonl", examples)

This is nearly identical to the link above -- but the one addition was I added a "answer": "accept" for each example if you want to accept all the examples (which may not be true). You can then load the file using db-in. Can you confirm if this solves your problem?

I think you can shrink this code more by replacing the srsly.write_jsonl and db-in step and load the examples directly into Prodigy's database using the database.add_examples method.

Were you splitting your text by sentences? If so, did you see the split_sentences function in prodigy?

from prodigy.components.preprocess import split_sentences
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "spaCy is a library. It is written in Python."}]
stream = split_sentences(nlp, stream, min_length=30)

This returns a generator (stream) so it is best to use within a custom recipe but you can use list() on them to convert to a list. This may be helpful if you need to split sentences on your own. As you may have noticed, ner.manual doesn't do the sentence splitting; however, ner.correct and ner.teach do automatically by using split_sentences(). Hope this helps for future purposes!

Unfortunately, I'm not aware. Sounds like it would be an interesting experiment.

Hello @ryanwesslen , hope you're doing fine.

Some comments / questions from my side, based in your feedback:

Thanks! Code snipet updated :grin:. You got the point, that's what really matters :wink:

Yes, it does solve my problem indeed. (To future readers, the code I used to "convert" "data.jsonl" to a table called "test_data" in SQLite, was: prodigy db-in test_data ./data.jsonl).

I have a question here: as an experiment, I have NOT included the "answer" related values, and what I got is the message: "Found and keeping existing "answer" in 0 examples". Does this mean that the table gets built, but all texts are considered to be rejected / ignored? Or what instead?

No, I did not see it, thanks! Something I did not see in the documentation, but could be of use for furture readers, is the addition of your own spans: split_sentences even takes care of the re-tokenization! It seems to take advantage of ner (and I think it would be also valid for ner_manual). The code I ran was:

from prodigy.components.preprocess import split_sentences
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [
    {
        "text": "spaCy is a library. It is written in Python.",
        "spans": [
            {"start": 0, "end": 5, "label": "FRAMEWORK"},
            {"start": 37, "end": 43, "label": "PROGLANG"}
        ]
    }
]
stream = split_sentences(nlp, stream, min_length=30)

After converting the generator into a list and printing it, the result was the following:

{'text': 'spaCy is a library.', 'spans': [{'start': 0, 'end': 5, 'label': 'FRAMEWORK', 'token_start': 0, 'token_end': 0}], 'tokens': [{'text': 'spaCy', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'is', 'start': 6, 'end': 8, 'id': 1, 'ws': True}, {'text': 'a', 'start': 9, 'end': 10, 'id': 2, 'ws': True}, {'text': 'library', 'start': 11, 'end': 18, 'id': 3, 'ws': False}, {'text': '.', 'start': 18, 'end': 19, 'id': 4, 'ws': True}], '_input_hash': -1201033890, '_task_hash': -1720991562}
{'text': 'It is written in Python.', 'spans': [{'start': 17, 'end': 23, 'label': 'PROGLANG', 'token_start': 4, 'token_end': 4}], 'tokens': [{'text': 'It', 'start': 0, 'end': 2, 'id': 0, 'ws': True}, {'text': 'is', 'start': 3, 'end': 5, 'id': 1, 'ws': True}, {'text': 'written', 'start': 6, 'end': 13, 'id': 2, 'ws': True}, {'text': 'in', 'start': 14, 'end': 16, 'id': 3, 'ws': True}, {'text': 'Python', 'start': 17, 'end': 23, 'id': 4, 'ws': False}, {'text': '.', 'start': 23, 'end': 24, 'id': 5, 'ws': False}], '_input_hash': 1690421792, '_task_hash': 2027191328}

Nice!!!

However, I have another question: What if I want to perform some "additional text cleaning" before splitting? As you know, when cleaning data, some characters could be removed, thus removing the related token, which as result, would not match the original entity spans location.

Thank you very much, let me know your thoughts on the newer questions.

Hi @dave-espinosa!

Glad to help!

Actually, from db-in documentation, I realized that the "answer" tag is optional. If it is missing, db-in will automatically add "answer": "accept" for you for records that are do not have an "answer" tag.

Because all examples in Prodigy need an "answer" value, "answer": "accept" is automatically added to all imported examples, unless specified otherwise in the data or via the --answer argument.

So while it has the error Found and keeping existing "answer" in 0 examples, it's saying that it kept your original "answer" tags for 0 examples because it replaced them for you. I can see now how that error message is a bit confusing.

What's important is that you should see in the output one line above that the same number of annotations were still loaded into the database and automatically populated as "accept".

Great question! I would suggest this helpful post about text cleaning / pre-processing philosophy in general in spaCy:

tl;dr - typically there's not a need to pre-process in spaCy.

Also:

The most important consideration with spaCy's models is that the input should resemble the training data.

The post does note that "One kind of preprocessing that can be helpful is normalizing spaces and punctuation", which may be more in line with what you're doing.

Alternatively, perhaps if you have some known cases you could also use some sort of matcher/replace to replace known issues with an entity string you know will be processed correctly. For example, if you find that "team X" is not treated correctly as an entity but "team-X" is. Instead of cleaning by adding a global rule to add "-" across some matcher rule set, you instead have a set of matcher examples to replace specific examples like changing "team X" with "team-X". The downside is it may be time consuming to compile and manage these matcher replacement pairs. But the upside is there are no unintended consequences where the global rule alters other entities.

Let me know if you have more specific questions as there may be other spacy universe tools that could help (e.g., spaczz for fuzzy matchers).