Segmentation and newlines in ner.manual

ines · April 17, 2018, 5:25pm

Hi! I hope I’m understanding your question correctly – so you want to load in texts from a “custom” format and separate them into annotation tasks according to your own logic, right?

One option would of course be to pre-process your data, read in the input file, split on \xa0 and then output a JSONL file with {"text": "contract 1"} etc per line.

You can also do this with a custom loader script in Python and then pip its output forward to the recipe. If no source argument is set on the command line, it will default to stdin (i.e. the output of the previous process). I’m describing this in more detail on this thread.

Here’s an example:

import json

contracts = YOUR_LONG_TEXT.split('\xa0')
# you might also want to do some stripping of whitespace etc. here

for contract in contracts:
    task = {'text': contract}
    print(json.dumps(task))  # output dumped JSON

You can then pipe the tasks forward like this:

python your_script.py | prodigy ner.manual your_dataset en_core_web_sm --label SOME_LABEL

Topic		Replies	Views
Customizations for the ner.teach UI ner	3	1195	January 11, 2018
Best Practices for Segmenting Text into Passages and Applying Multi-label Classification	1	533	September 13, 2023
Strange text segmentation with ner.teach recipe usage	7	561	September 9, 2019
prodigy splitting sentences for annotation enhancement , usage , done	14	3264	December 12, 2019
HTML to jsonl and NER task workflow usage , ner , solved	6	808	July 19, 2019

Segmentation and newlines in ner.manual

Related Topics