Sounds like this might be a good case for the text classification mode? Just feed in the sentences, and annotate them with the label RELEVANT
(or something like that). There's also a video tutorial we've recorded on this topic in which I'm training an insults classifier on Reddit data from scratch.
After collecting a bunch of relevant sentences, you can export the dataset to a .jsonl
file. This file can then be loaded back into Prodigy, for example, to annotate entities within the text.
prodigy db_out relevant_dataset /output_path
Yeah, this is where the NER model comes in. You need at least some model that will predict something, even if it's bad. When you call ner.teach
, the model you load in will be used to predict entities in your text, and the candidates will then be shown in the app so you can decide if they're correct or not.
We just published spaCy v2.0.0a18 with a new German model that supports NER: spaCy · Industrial-strength Natural Language Processing in Python We probably need to release an update of Prodigy that makes it compatible with the new spaCy version, so you can load in the new models.
In the meantime, you can also pre-process your data with the new German model, generate the spans
from the doc.ents
and then annotate them one by one using Prodigy's mark
recipe:
nlp = spacy.load('de_core_news_sm')
examples = []
for text in ['A document...', 'Another document...']:
# if your texts are long, maybe iterate over the sentences, too?
doc = nlp(text)
spans = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents]
examples.append({'text': doc.text, 'spans': spans})
jsonl_data = '\n'.join([json.dumps(line) for line in examples])