Datasets and using pre-annotated data

usage
(Ya) #1

I finally used your example data set and it works, I called that new_dataset, after annotation where can I find that?

my main question, imagine I have the pre-annotated data in format pickle,(json) how can I use prodigy to improve annotation?

(Ines Montani) #2

The data you’re annotating will be saved to a dataset in the Prodigy database. To export the annotations, you can use the db-out command:

prodigy db-out new_dataset > annotations.jsonl

This depends on what you want to do and what your goal is. Do you want to train a machine learning model? Do you want to correct labelled data? If you want to improve the existing annotations by and correct them, you can convert them to Prodigy’s format and load them in using a recipe like ner.manual and re-annotate them.

You might also want to check out the PRODIGY_README.html, which is available for download with Prodigy. It includes the detailed documentation and also an overview of the JSON format that Prodigy reads and creates.

(Ya) #3

we have the data which is annotated by regex, we want to add some new labels to that and then use new annotated data for training a spacy model,

I read some of your comments, still need to know , which format should I provide to feed to your interface

I have now access to this format:

pickel file of annotated text

can you help me a bit in this area of adding new labels to pre-annotated data and also imrove annotation task provided by regex

best

(Ines Montani) #4

If you look at the “Annotation task formats” section in your PRODIGY_README.html, you’ll find the exact JSON format that Prodigy expects for pre-annotated data for the different annotation types (NER, text classification etc.). The format should be pretty straightforward: for each example, you usually have a "text" and then either a "label" or "spans", depending on what you’re annotating. You can then convert your pre-annotated data accordingly. For example, for named entity recognition, you’ll need the text and the start/end character offsets and labels for the entities in that text.

1 Like
(Ya) #5

Many thanks for your responses, for test,
I read my raw data by prodigy and also add arbitrary labels ( I defined by --label in ner.manual) to that. It shows I am able to do annotation on raw data for an arbitrary set of labels. here I used a json file containing the sentence of my data and then converted to jsonl and every thing was ok.

Now again to my question, since I did the special tokenization on my data by regex and also annotated the data by regex,
I have the annotated data in this format in python:

[[(‘Therefore’, ‘None’),
(‘CD’, ‘GEOM’),
(‘being’, ‘None’),
(‘dropped’, ‘None’),
(‘perpendicular’, ‘None’),
(‘to’, ‘None’),
(‘AB’, ‘GEOM’),
(‘where’, ‘None’),
(‘AD’, ‘GEOM’),
(‘which’, ‘None’),
(‘is’, ‘None’),
(‘half’, ‘None’),
(‘AB’, ‘GEOM’),
(‘is’, ‘None’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘will’, ‘None’),
(‘be’, ‘None’),
(‘3333⅓’, ‘NUM’)],
[(‘Looking’, ‘None’),
(‘this’, ‘None’),
(‘up’, ‘None’),
(‘in’, ‘None’),
(‘a’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’),
(‘we’, ‘None’),
(‘find’, ‘None’),
(‘the’, ‘None’),
(‘angles’, ‘None’),
(‘CAD’, ‘GEOM’),
(‘and’, ‘None’),
(‘CBD’, ‘GEOM’),
(‘to’, ‘None’),
(‘be’, ‘None’),
(“72° 33’”, ‘COORD’)],
[(‘So’, ‘None’),
(‘also’, ‘None’),
(‘at’, ‘None’),
(‘16°’, ‘ANG’),
(‘or’, ‘None’),
(‘17°’, ‘ANG’),
(‘Aquarius’, ‘None’),
(‘with’, ‘None’),
(‘AB’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘1375’, ‘NUM’),
(‘so’, ‘None’),
(‘if’, ‘None’),
(‘AD’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘2750’, ‘NUM’),
(‘showing’, ‘None’),
(“68° 40’”, ‘COORD’),
(‘in’, ‘None’),
(‘the’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’)]]

do you have any suggestion that how can I proceed form here? probably I should make a same format that you mentioned , is there any option that I can use prodigy in any thought

many thanks

(Ines Montani) #6

Yes, this looks good – now you can write a small function that takes your tokens and outputs them as a dictionary with a "text", "tokens" and "spans". Do you still have the original text with whitespace? Otherwise, you’ll have to reconstruct that by concatenating the token texts.

Everyone’s raw data is different, so there’s no converter that takes exactly what you have and outputs JSON. But Prodigy standardises on a pretty straightforward format, so hopefully it shouldn’t be too difficult to write a function that converts your annotations in Python or any other programming language you like.

(Ines Montani) #8

If there’s something you can automate (for example, with regex), you definitely want to take advantage of that! The more you can automate or pre-select, the better. This saves you time and reduces the potential for human error :slightly_smiling_face:

Here’s a quick example for a conversion script in Python – I haven’t tested it yet but something like this should work. You take a bunch of regular expressions, match them on all your texts, get the start and end character index and format them as "spans" in Prodigy’s format. At the end, you can export the data to a file data.jsonl.

import re
from prodigy.util import write_jsonl

label = "LABEL"   # whatever label you want to use
texts = []  # a list of your texts
regex_patterns = [
    # your expressions – whatever you need
    re.compile(r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})")
]

examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
        for match in re.finditer(expression, text):
            start, end = match.span()
            span = {"start": start, "end": end, "label": label}
            spans.append(span)
        task = {"text": text, "spans": spans}
        examples.append(task)

write_jsonl("data.jsonl", examples)
1 Like