Best approach for using ner manual and mark

I have a question on a use case. I am annotating text and use ner manual to label phrases in a longer text.

My next step would be to assign a sentiment value to each of the extracted labels based on their associated text. My desired output is to have a phrase annotated with a label and sentiment. The question is what is the best way to do that.

My thoughts were to create a custom recipe which uses the export of ner manual, extracts text (can I reuse some existing components for this) and label and then for each label creates an input stream for mark procedure.

Another idea would be just to convert the jsonl export of ner manual into a jsonl file for mark procedure.

I would love to hear how you would approach this use case. Thanks for sharing.

Hi! I think your ideas sound very reasonable. How complex is your sentiment label scheme? Is it mostly binary, i.e. positive or negative? If so, you could set this up as a binary annotation task and use the classification interface to accept / reject a label applied to the whole text.

You probably also want to only focus on one highlighted span in context at a time, right? In that case, you could do something like this and create a new example for each highlighted span:

from prodigy.components.db import connect
from prodigy.util import set_hashes, write_jsonl
import copy

db = connect()  # use the settings from your prodigy.json
examples = db.get_dataset('your_ner_manual_dataset')

new_examples = []  # export this later

for eg in examples:
    for span in eg.get('spans', []):  # iterate over the spans
        new_eg = copy.deepcopy(eg)  # copy example for each span
        new_eg['spans'] = [span]  # create example with only one span
        # optional: add a label to the whole example
        new_eg['label'] = 'POSITIVE'
        new_eg = set_hashes(new_eg)  # set new hashes, just in case
        new_examples.append(new_eg)

# export the new examples
write_jsonl('/path/to/data.jsonl', new_examples)

Your data should now have one highlighted span per example and an added label. If you load this into the classification interface, you’ll be able to collect binary feedback on whether the label POSITIVE applies to the text with the highlighted labelled span:

prodigy mark your_sentiment_dataset /path/to/data.jsonl --view-id classification

If you want to annotat more sentiment labels than just positive / negative, you could either create one example per label and make multiple passes over the data. In some cases, this can actually be more efficent than doing it all at once, because you get to focus on one concept at a time. In other cases, you might want to use the choice interface with a range of options instead. You can find example code for this in the custom recipes workflow. For your use case, you’d just have to edit the script above to add new_eg['options'] instead and then run the mark recipe with --view-id choice.

Hi Ines,

Thanks for the thoughtful and prompt response. I’ll try your solution and will let you know how I get on.

Best,

Bob

Thanks again. Everything worked out really nice.

Final question… :slight_smile:

I now would like to use the output in a FastText model and I’ll need to extract the words for each span and associated span labels and create a txt file. What is the best way to do that? I was thinking of using something like underscore and then extract the text based on the token id’s. Or is there a function in displaCy (Named Entity Visualizer) I could reuse perhaps?

Thank you.

Yay, that’s nice ot hear! :slightly_smiling_face:

What exactly do you need for the tokens? Just the text, or also the offsets? If you’ve used the ner.manual recipe to mark the spans, you’ll already have most of that info in your data. A simple example with dummy data:

{
    "text": "Hello Google Home",
    "tokens": [
        {"text": "Hello", "start": 0, "end": 5, "id": 0},
        {"text": "Google", "start": 6, "end": 12, "id": 1},
        {"text": "Google", "start": 13, "end": 17, "id": 2},
    ],
    "spans": [
        {"start": 6, "end": 17, "label": "PRODUCT", "token_start": 1, "token_end": 2}
    ]
}

Here, the span refers to the text[6:17] (“Google Home”) and includes the tokens 1 to 2, so tokens[1:2 + 1] (the last index is inclusive). The offsets of those two tokens are text[6:12] and text[13:17], respectively. So based on this information, you should be able to create the txt file straight from your training data.

Thanks for your comments. I understand better now.

I do have an issue however. When I started annotating with sentiment I found that while for each span a new json object was created, I could not annotate all of the. I have 2500 spans to annotate, but I can only annotate 246 spans, I think what is happening is that the sentiment for each span is applied for the all spans per review.

When I inspected the input file, I found the input and task hash for each span in the same review is the identical. Might this cause the behavior that spans with the same ids get treated as one? Here is the code I use below to generate the jsonl and the recipe I use.

# cmd
prodigy mark my_new_data export/sentiment_data.jsonl --view-id choice --memorize

# code for generating jsonl
import copy
import prodigy
from prodigy.components.db import connect
from prodigy.util import set_hashes, write_jsonl

db = connect()  # use the settings from your prodigy.json
examples = db.get_dataset('my_dataset')

new_examples = []  # export this later

for eg in examples:
    for span in eg.get('spans', []):  # iterate over the spans
        new_eg = copy.deepcopy(eg)  # copy example for each span
        new_eg['spans'] = [span]  # create example with only one span
        # optional: add a label to the whole example
        new_eg['options'] = [
                             {'id': 'very_positive', 'text': 'Very Positive'},
                             {'id': 'positive', 'text': 'Positive'},
                             {'id': 'neutral', 'text': 'Neutral'},
                             {'id': 'negative', 'text': 'Negative'},
                             {'id': 'very_negative', 'text': 'Very Negative'}
                             ]
        new_eg = set_hashes(new_eg)  # set new hashes, just in case
        new_examples.append(new_eg)

# export the new examples
write_jsonl('export/sentiment_data.jsonl', new_examples)

Yes, I think your analysis is correct – by default, Prodigy will filter out duplicate examples with the same task hash (which makes sense, because it expects them to be identical).

I think I found the problem – sorry, this was mostly in my code example:

new_eg = set_hashes(new_eg, overwrite=True)

By default, the set_hashes method will only add hashes if they don’t yet exist. But since we’re copying existing examples that already have hashes, we need to set overwrite=True to overwrite the existing hashes and create new ones based on the new task data.

Yes, that did it. Thanks for helping me out.

1 Like

Hi Ines,

I am using a multiple label model per sentence (from StarSpace). This requires that I create a format where each sentence and its associated labels are in one line. I thus need to figure out which span belongs to which sentence. What would you recommend to obtain this information? Should I use split_sentence and match the tokens from the label to the sentence? Or is there a smarted way to assess in which sentence of the text the label(s) reside?

Thanks for your help.

@Bob I’m not 100% sure I understand the requirements and the exact format – do you have an example by any chance? :slightly_smiling_face:

Hi Ines,

Thx for the quick response. Here is a stylized example. I need to match the labels with the sentences. So in the example below GEN_EVALUATION and PROD_PICTURE_QUALITY belong to sentence 1 and PROD_BUILD_QUALITY and PROD_VIDEO belong to sentence 2.

[{GEN_EVALUATION, PROD_PICTURE_QUALITY, The Fuji Finepix 4700 is a great camera for its age, the picture quality is very good at the high pix setting},{PROD_BUILD_QUALITY, PROD_VIDEO, The camera is well built ,the video quality is just good, there are better video cameras out there}

]

Hope this makes some sense,

B.

{"_input_hash":-839706930,"spans":[{"token_start":0,"end":51,"label":"GEN_EVALUATION","start":0,"token_end":10},{"token_start":11,"end":109,"label":"PROD_PICTURE_QUALITY","start":51,"token_end":22},{"token_start":24,"end":134,"label":"PROD_BUILD_QUALITY","start":110,"token_end":28},{"token_start":30,"end":208,"label":"PROD_VIDEO","start":136,"token_end":43},{"token_start":45,"end":292,"label":"PROD_BUILD_QUALITY","start":209,"token_end":62},{"token_start":63,"end":465,"label":"PROD_MEMORY_CARD","start":292,"token_end":96},{"token_start":97,"end":503,"label":"GEN_FEATURES","start":465,"token_end":106}],"text":"The Fuji Finepix 4700 is a great camera for its age, the picture quality is very good at the high pix setting.The camera is well built ,the video quality is just good, there are better video cameras out there.This camera is built to last and if taken care of will last for several more years.The only [[ASIN:B00004TH2X Fujifilm FinePix 4700 2.4MP  Digital Camera w/ 3x Optical Zoom]]problem is trying to find another smartmedia card,even best Buy doesn't have them.All in all this is a very good camera.","_task_hash":-1970613402}

Ahh okay, I think I understand. Your idea makes sense, or you could also do it in spaCy directly, using the same model you’ve used for tokenization when you annotated. One of the nice things about spaCy’s data structures like the Doc, Span and Token is that you never lose any information, and can always map individual tokens and spans back to their positions in other spans (like sentences).

So you can process the text with spaCy and get all spans via their token_start and token_end – for example, the first span would be doc[0:10]. You can then match this up with the spans in doc.sents. Each sentence exposes a .start and .end index, which is the start and end of the sentence span in the original document. So by comparing span.start / span.end and sent.start / sent.end, you can find out whether a sentence contains a span.

The only part where this gets tricky is if your annotations span across sentence boundaries – but it sounds like this doesn’t usually happen, because this would mess with the whole concept of your annotation scheme, right?

Thanks for the quick response. Ah great that no info is lost and the ‘doc.sents’ contain start and end parameters. Matching these should work fine. Thanks, these responses save me a lot of time!

Hi Ines,

I managed to include the sentence for each label using the code below (any tips /tricks on how to do this better are always welcome.) I am having trouble reconstructing the text (which is part of the sentence) per label from the tokens.

How do I extract and paste the tokens into a string from a span object?

For example I have

spans: [
{
"label": "GEN_EVALUATION",
"sentence": "The Fuji Finepix 4700 is a great camera for its age, the picture quality is very good at the high pix setting.",
"sentence_token_end": 24,
"token_end": 10,
"sentence_token_start": 0,
"start": 0,
"end": 51,
"token_start": 0
}
]

And want to add an element to this object which contains the phrase which is (“The Fuji Finepix 4700 is a great camera for its age,” = start = 0, end = 51 ) which relates to label GEN_EVALUATION.

Thanks

Bob


Getting sentences

db = connect()  # use the settings from your prodigy.json
examples = db.get_dataset('get_data_set')
nlp = spacy.load('en_core_web_sm')

new_examples = []  # export this later

# iterate over data points
for eg in examples:
    new_eg = copy.deepcopy(eg)      # make a deep copy
    doc = nlp(eg.get('text',[]))    # parse original text
    sentence_arr = []

# iterate over sentences
for sent in doc.sents:

    sentence_struct = {'sentence_token_start':sent.start,
                       'sentence_token_end':sent.end,
                       'sentence_text':sent.text}

    sentence_arr.append(sentence_struct) # add to sent struct

# assign sentence
new_eg['sentence'] = sentence_arr

# iterate over spans and assign sentenence to span
for span in eg.get('spans', []):
    sentence = next((x for x in sentence_arr if x['sentence_token_start'] >= span['token_start'] & x['sentence_token_end'] <= span['token_end']), None)
        span['sentence'] = sentence['sentence_text']
        span['sentence_token_start'] = sentence['sentence_token_start']
        span['sentence_token_end'] = sentence['sentence_token_end']

new_eg['spans']= [span]

# append new object
new_examples.append(new_eg)

I have a project that I need to extract Cyber Entities such as Malware name and Threat Actors. I was thinking to use NE model to extract a very generic NE as a CYBER label then use a classification to classify each of entity. One sentence can have both of entities. My question is that how the classification works. Does it classify a whole sentence or each extracted entity?

I think your plan sounds good: keeping the NER labels a bit more generic, and using the text classification to do the labelling, is often a good approach, especially if you want one span to have multiple labels.

To answer your question: yes, the text classifier will be looking at the whole sentence, not just the span.

When I use mark recipe, it shows all labels either I accepted or rejected them. How can I filter out reject choices, so classification only shows labels that I accepted before.

I'm not sure I understand your question, sorry! If you want to filter out accepted or rejected annotations after collecting the data, you can use db-out to export the data as JSONL (or load examples from the database in a Python script), and then filter all examples that have "answer": "accept", "answer": "reject" or "answer": "ignore", depending on what you need.

I classified 1800 sentences and got Accuracy: 0.87 and F-score: 0.92. Then loaded a model into Spacy and test on a couple of sentences to get a "cats" attribute to find a score for each category. For most of the test sentence I get {'x_class': 0.9993245601654053, 'y_class': 0.9999545812606812, 'z_class': 0.9999545812606812}
All of them are very close. Do I miss any step here? I used prodigy textcat.batch-train for my multi classification.

It looks like your model has learned "all classes always apply". Do you have enough negative examples in your training and evaluation data? An easy way to check this is to write a function that takes the data from your dataset and counts up the occurrences of "answer": "accept" / "answer": "reject" for each "label".