Segmentation and newlines in ner.manual

Hi! I hope I’m understanding your question correctly – so you want to load in texts from a “custom” format and separate them into annotation tasks according to your own logic, right?

One option would of course be to pre-process your data, read in the input file, split on \xa0 and then output a JSONL file with {"text": "contract 1"} etc per line.

You can also do this with a custom loader script in Python and then pip its output forward to the recipe. If no source argument is set on the command line, it will default to stdin (i.e. the output of the previous process). I’m describing this in more detail on this thread.

Here’s an example:

import json

contracts = YOUR_LONG_TEXT.split('\xa0')
# you might also want to do some stripping of whitespace etc. here

for contract in contracts:
    task = {'text': contract}
    print(json.dumps(task))  # output dumped JSON

You can then pipe the tasks forward like this:

python your_script.py | prodigy ner.manual your_dataset en_core_web_sm --label SOME_LABEL
1 Like