Hello everyone,
I am using Prodigy to train a NER model, with 4 different labels. In my first approach, I did not obtain good-enough results, and after checking this flowchart, I realized that maybe, it would be worth trying to train another model, with smaller texts this time. I used Prodigy's ner.manual
to build my original training dataset (with "large" texts), obtaining a jsonlines file as result, in which each sample is a dictionary with text
, _input_hash
, _task_hash
, _is_binary
, tokens
, _view_id
, spans
(not available if the text was rejected or ignored), answer
, _timestamp
, ant their corresponding values . I then built my own function (as I did not find any function in either Prodigy or spaCy), to "split" the text
from the previously mentioned dataset, into smaller ones (while modifying accordingly the Spans on the labeled entities). What I have now, looks like this (code borrowed from your "Spacy Advanced Course, chapter 4"):
import spacy
from spacy.tokens import DocBin, Span
nlp = spacy.blank("en")
# New "smaller" Docs with updated entity spans
doc1 =nlp("Python and Java, are two very important assets in the role.")
doc1.ents = [Span(doc1 , 0 , 1 , label="LABEL1"), Span(doc1, 2 , 3 , "LABEL2")]
doc2 = nlp("As a data scientist, you are expected to do that and beyond.")
doc2.ents = [Span(doc2, 2 , 4 , label="LABEL3")]
doc3 = nlp("I need a new phone! Any tips?")
doc3.ents = []
docs = [doc1, doc2, doc3] # With all the docs...
dataset = DocBin(docs=docs)
dataset .to_disk("./dataset.spacy")
But now, I am a bit lost, more specifically:
- How to "convert" that
dataset.spacy
file, into something compatible with Prodigy? Is there a more direct approach in what I am doing here? (I could convert what I currently have intoner
orner_manual
instead, but then I don't know how to use the generated jsonlines, withdb-in
(the documentation does not include any example, and the Description is not quite self-explanatory). - Is there any "recommended ratio" between samples "with and without" entity spans?