Question 1
Can I create one sentence per line format of the input text files using Spacy / Prodigy ?
There are some recipes that can split the text on your behalf. The ner.correct and ner.teach recipes both carry a --unsegmented
flag that could be used.
More generally though, the most direct way of getting what you want is to adapt the examples.jsonl
file that you pass to Prodigy. Nothing is stopping you from pre-processing this file as you see fit, which includes running spaCy beforehand to turn paragraphs into sentences. You'd use a script that looks something like below:
import spacy
import srsly
# Best practice: use the same model as you use in Prodigy
nlp = spacy.load("en_core_web_md")
examples = srsly.read_jsonl("path/to/examples.jsonl")
def to_sentences(examples):
# parse each example with spaCy
for doc in nlp.pipe(examples):
# loop over all the detected sentences
for sent in doc.sents:
yield {"text": sent.text}
# Create new generator with sentence examples
new_examples = to_sentences(examples)
# Save them to disk
srsly.write_jsonl("path/to/sentence_examples.jsonl", new_examples)
One small caveat: the sentence splitting heuristics in spaCy can make mistakes. On social media data in particular it can make a errors when punctuation is all over the place.
Another caveat: it's hard to say upfront. But sometimes it's easier for the model to detect spans/entities when you keep neighboring sentences around. This'd mainly be a problem if you have very short sentences and you're trying to detect very long spans with somewhat fuzzy starts and ends. I don't think it'll be much of a problem for you, but I figured it good to at least mention that this might have a consequence for the trained model later.
Question 2
I would like to assign a "section_name" label to each sentence that is a part of the given section. Is that possible in Prodigy?
Sure, that sounds like you'd also want to train a textclassifier and Prodigy has support for that. How many categories are you talking about though? Also, can these categories overlap?
You'd probably use an interface like textcat.manual
to annotate some examples and then when it is time to train a model you can tell Prodigy to use the two datasets for the two tasks. That command would look something like:
python -m prodigy out_dir --ner ner_dataset_name --textcat clf_dataset_name --lang en --base-model en_core_web_md
Question 3
Is there any way I can export the annotations created using ner.manual recipe in word|POS|BILUO
format ?
As long as you can write a custom Python script that does what you want, you can output the data into any format you like! This isn't something I can give much advise on though, mainly because I'm only familiar with the spaCy and scikit-learn ecosystems when it comes to file-formats.
If you're doing down the custom scripting route, you'll probably want to use this example as a starting point. In particular, you'd probably write something like:
from prodigy.components.db import connect
def turn_into_custom_format(example):
# you'd need to implement this yourself
pass
db = connect() # uses settings from prodigy.json
dataset = db.get_dataset("name_of_dataset") # retrieve a dataset
new_dataset = (turn_into_custom_format(e) for e in dataset)
The db.get_dataset
method returns a list of dictionaries that can you can edit any way you see fit.