prodigy use case for annotation having pre-annotated text

usage
solved

(Luca) #1

I’m struggling to understand how to fit prodigy in the picture here, let’s say I want to use it as an annotation tool, but for pre-annotated text.

In the documentation I saw it’s possible to 1) annotate your text and 2) import your annotations, however I’m missing some blocks here.

Let’s take an example, I have the following text:

{'text':'This is a text talking about the strange behaviour of superconductor material Br2A2, which has Tc (Critical Temperature) = 23K'}

And the following annotations

{'text': 'Br2A2', 'label': 'supercon', 'spans': [{'start': 123, 'end': 456, 'label': 'supercon'}], 'meta': {'source': 'Semi-automatic generation supercon'}}
{'text': 'Tc (Critical Temperature)', 'label': 'abbreviation', 'spans': [{'start': 1000, 'end': 1020, 'label': 'abbreviation'}], 'meta': {'source': 'Semi-automatic generation abbreviation'}}

(FYI the offsets are just random)

Let’s suppose we have multiple set of text with the corresponding annotations.

How can this be loaded in prodigy and an user can correct that?

Here my questions:

  1. The mark recipt cannot work because I need also to load annotations.
  2. The part where how to load annotations is explained, is not clear, because it is assumed that there is a model in spacy that we can use (https://prodi.gy/docs/workflow-first-steps#load-data), this is not my case
  3. at the same page: https://prodi.gy/docs/workflow-first-steps#import-annotations if I have multiple sets of annotations (for example coming from different files) how can I relate each text with their respective batch of annotations? Should I use the ‘meta’ / ‘source’? if so, how? with a recipe?
  4. I’d like to create a recipe and add it to the ‘registered’ ones but again, this doesn’t seems to be the way it has been designed, so I can use a recipe only for annotation, correct?

Thank you very much
Luca


cannot add new recipe
(Ines Montani) #2

I think the main issues around loading your data should be resolved by fixing your JSON, as described here:

https://support.prodi.gy/t/loading-annotations/1272/3

The example commands in this section show two of the most commonly used built-in recipes: ner.teach and textcat.teach. Both of those use a spaCy model in the loop because that’s part of that particular annotation workflow. None of this has anything to do with loading data.

Loading data into Prodigy should be as straightforward as passing the path to the JSONL file to a given recipe as the source argument. For example, the mark recipe if all you want to do is collect feedback on every example that comes in:

python -m prodigy mark your_dataset /path/to/file.jsonl

This will load the examples from file.jsonl, ask you to accept/reject each example and save the resulting annotations to the dataset your_dataset.

If you want to manually correct the examples (e.g. delete or add spans), you can use the ner.manual recipe, which also respects pre-defined spans in your data. Note that in this case, you do need a spaCy model for tokenization only to allow token-based highlighting.

python -m prodigy ner.manual your_dataset en_core_web_sm /path/to/file.jsonl --label supercon

You can find more details on this in the docs, specifically in the PRODIGY_README.html, available for download with Prodigy.

Yes, using the meta is a good option. You can do this when you create your JSONL files – how you do that is up to you. You can do it by hand, write a Python script, or use any other programming language or tool you’re comfortable with. In the end, you should have a file containing your data.

I’m not 100% sure I understand your question. Recipes in Prodigy are Python scripts that let you set up annotation workflows. They can be executed from the command line, and if they return a dictionary of components, they start the annotation server. If you’ve written your own recipe, you can use it in Prodigy just like the other recipes – you just need to set -F path/to/recipe.py at the end of the command to point Prodigy to the file with your code, so it can load it.


(Luca) #3

Thanks again @ines for the quick responses. I try to answer to each element

Yes, that was the other issue. Now I finally understood (it was in the README provided with the licence, not in the online doc, if I haven’t missed it out) that I could load the data in the following format:

{"text":"This is a text talking about the strange behavior of superconductor material Br2A2, which has Tc (Critical Temperature) = 23K", "spans": [{"start": 1000, "end": 1020, "label": "abbreviation"}, {"start": 123, "end": 456, "label": "supercon"}]}

and then using the command

prodigy ner.manual your_dataset en_core_web_sm /path/to/file.jsonl --label label1,label2

This part is confusing because is the only part of the documentation mentioning how to load the data.

OK so far so good.
In my case I want to use my own tokenisation and I’ve noticed that the data can be provided already tokenised using the tokens property. :+1: in that case I should define my own recipt and deal with the data with it, right?

OK I though I could register the recipe like the built-in ones, nevermind.


(Ines Montani) #4

Glad it’s working now! (And it really sounded like you did all the complicated stuff correctly and in the end, it was really just the quotes. I can definitely relate, stuff like that happens to me all the time :sweat_smile:)

Yes, exactly! You can find an example of the tokens format in the readme. You can do the data transformation in a recipe, but you don’t have to – you can also add the "tokens" in a separate process, save the result out as JSONL and then load that into the ner.manual recipe.

It is possible using Python entry points – see the “Entry points” section in the docs for details. But it might be a little overkill during development, because it means you have to wrap your recipes in a Python package. However, once you have a set of custom recipes that you like to use, it’s pretty convenient because the recipes are registered automatically if your package is installed in the same environment as Prodigy.


(Luca) #5

Thanks @ines,

I’ve got few more questions on the same subject… I’m trying to define a “custom” workflow, let’s say I’d like to load pre-annotated data in the database (I have the correct file format) and then, using the session id generated, correct them.

  1. I’ve noticed that after import I cannot see the tasks, is it due to the flag ‘accepted’?
  2. Sometimes happens that annotation guidelines are updated so would be nice to be able to get back on previously annotated text and review it. Any idea if this would be possible with a custom recipe?
  3. any suggestion on how calculate the progress in advance on the loaded tasks in the current session?

Thank you
Luca


(Ines Montani) #6

What do you mean by “cannot see the tasks”?

Yes, and you probably don’t even need a custom recipe for that. The format of the annotations you export is the same as the input format. So you can export an already annotated dataset and load the resulting file back in to correct it. For example:

prodigy db-out ner_dataset > ner_dataset.jsonl
prodigy ner.manual new_ner_dataset en_core_web_sm ner_dataset.jsonl --label SOME_LABEL

I’m not 100% sure I understand the question. Do you have an example? Do you mean, display the progress on the current input data in the sidebar? In that case, you might want to check out the progress recipe setting in the custom recipe docs. It lets you pass in a custom function to calculate the progress that’s displayed in the UI.


(Luca) #7

For reference here my recipe: https://gist.github.com/lfoppiano/f052de094f5920136f511cf67a1e0d08

Here what I do:

(pyscript) Lucas-MacBook-Pro:prodigy-preprocessing lfoppiano$ prodigy dataset  supercon1

  ✨  Successfully added 'supercon1' to database SQLite.

(pyscript) Lucas-MacBook-Pro:prodigy-preprocessing lfoppiano$ prodigy db-in supercon1 output.jsonl

  ✨  Imported 24 annotations for 'supercon1' to database SQLite
  Added 'accept' answer to 24 annotations
  Session ID: 2019-03-07_08-11-31

(pyscript) Lucas-MacBook-Pro:prodigy-preprocessing lfoppiano$ prodigy superconductor-material-recipe supercon1 output.jsonl  -F superconductor-material-recipe.py --label supercon,abbreviation

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

(pyscript) Lucas-MacBook-Pro:prodigy-preprocessing lfoppiano$ 

and then I get “no tasks available”:

I’m passing to the recipe the same file I’ve imported with db-in, so I thought that because these were flagged as “accepted” they didn’t show up in the annotation UI.

OK so the workflow imply that the source is taken from an input file, I cannot load the annotations from the sqlite with a session number?

Yes I’ve tried that but I might have misunderstood what “total” means, the total number of tasks or the total number of tasks annotated? Right now it starts from 0

17

The example given was

    def progress(session, total):
        return total / 10000

which I copied but then I wanted to show the progress as total tasks done / total number of tasks (todo + done)

I’m not sure what total means there


(Ines Montani) #8

Ah yes, that’s correct – the answer doesn’t even matter. By default, Prodigy will exclude examples that were already answered in the current dataset. This usually makes sense, because you only want to see a question once.

Is there a specific reason you’ve imported the examples to supercon1 before annotating? The dataset is used to store the annotations that are streamed in from the file – so you usually don’t need to import anything beforehand.

By default, yes. You can load annotations from a dataset in Python, too, but I’m not 100% sure if this is really more convenient in your case?

But just for completeness, here’s how you would do it:

from prodigy.components.db import connect

db = connect()  # uses settings from your prodigy.json
examples = db.get_dataset("ner_dataset")

total is the total number of annotations saved in the current dataset. For example, let’s say you start an annotation session and label 100 examples. Then you save, quit and the next day your start annotating again with the same dataset. The total will then be at 100 (because the set already has 100 annotations) and “This session” will be 0.

So if your input file has 5000 examples that you want to annotate, the progress would be total / 5000.


(Luca) #9

Thanks @ines. I’ll try and if I have more questions I come back to you.

Cheers
Luca