Avoid restarting from zero...

Hi,
i have noticed that ner.match start the annotation from the beginning at every annotation session. Why? I see the same samples i already have annotated.

When i reboot my pc and start the session again with:

python3 -m prodigy ner.match email it_core_news_sm cv.jsonl --patterns patterns.jsonl

prodigy start from the first sample on my cv.jsonl data source.

I have around 80k sentences, i would like to continue with new sentences, is that possible?

Thanks

By default, Prodigy makes a little assumptions about your stream as possible. But you can tell it to exclude annotations from one or more datasets via the --exclude option, e.g. --exclude email. You can exclude one or more dataset names (comma-separated list), including the current dataset. This is also very useful for evaluation data, because you want to make sure that no training examples end up in your evaluation set, and vice versa.

@ines thank you for your quick reply but i still have the same problem.

I have created a custom recipe that is 95% the same as ner.match. Basically this recipe truncates the text near the entity i should accept/reject. It is quite useful because as i already told you i am dealing with long texts. The code is:

def my_generator(stream, window):
	for _, eg in stream:                                 
		start = eg['spans'][0]['start'] - window if eg['spans'][0]['start']-window>=0 else 0
		end = eg['spans'][0]['end'] + window if eg['spans'][0]['end']+window<=len(eg['text']) else len(eg['text'])        
		eg['text'] = eg['text'][start:end]
		eg['spans'][0]['start'] = eg['spans'][0]['start'] - start
		eg['spans'][0]['end'] = eg['spans'][0]['end'] - start        
		yield eg

@recipe('custom.ner.match',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        patterns=recipe_args['patterns'],
        source=recipe_args['source'],
        window=plac.Annotation("window", type=int),
        api=recipe_args['api'],
        loader=recipe_args['loader'],        
        exclude=recipe_args['exclude']) 
def custom_ner_match(dataset, spacy_model, patterns, source, window=100,
                     api=None, loader=None, exclude=None):                         
    """
    Suggest phrases that match a given patterns file, and mark whether they
    are examples of the entity you're interested in. The patterns file can
    include exact strings, regular expressions, or token patterns for use with
    spaCy's `Matcher` class.
    """
    log("RECIPE: Starting recipe ner.match", locals())
    DB = connect()
    # Create the model, using a pre-trained spaCy model.
    model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
    log("RECIPE: Created PatternMatcher using model {}".format(spacy_model))
    if dataset is not None and dataset in DB:
        existing = DB.get_dataset(dataset)
        log("RECIPE: Updating PatternMatcher with {} examples from dataset {}"
            .format(len(existing), dataset))
        model.update(existing)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    return {
        'view_id': 'ner',
        'dataset': dataset,
        'stream': my_generator(model(stream), window),
        'exclude': exclude
    }

Running it via

python3 -m prodigy custom.ner.match email it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py --exclude email

I still see the first sentence that i have annotated (i double-checked it with db-out and i already see the annotation).

However ner.match gives me the same problem. Any workaround?
Thanks

Just to confirm: You’re using the latest version, v1.4.1, right?

Yes @ines

damiano@damiano:~$ python3 -m prodigy stats

  ✨  Prodigy stats

  Version            1.4.1              
  Total Sessions     77                 
  Prodigy Home       /home/damiano/.prodigy 
  Location           /home/damiano/.local/lib/python3.5/site-packages/prodigy 
  Platform           Linux-4.13.0-37-generic-x86_64-with-Ubuntu-16.04-xenial 
  Total Datasets     3                  
  Database Name      SQLite             
  Python Version     3.5.2              
  Database Id        sqlite

@ines I have noticed that the problem only occurs when i open the browser the first time.
Let me explain… I run this command:

python -m prodigy ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl --exclude test

then i see the normal prodigy output:

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

When i open the browser i am ABLE to annotate the same (already annotated) sentences BUT if i call http://localhost:8080 a second time i see: No tasks available.

so, basically a ctrl- r fix the problem for the moment :smiley:

Hope this helps.

Thanks for updating! And that’s interesting – this would indicate that some examples are not filtered out correctly in the first batch. Basically, what happens when you reload the web app is that it makes a new request to /get_questions, which asks Prodigy for a new batch of tasks.

Another thing you could try is rehashing the tasks manually in your my_generator function:

from prodigy import set_hashes

# your other code here
yield set_hashes(eg)

This will ensure that each examples receives unique hashes that let Prodigy determine whether a question has been asked before. If this produces different hashes than the ones Prodigy created before, you might still see the same examples one more time – but once the hashes are in the dataset, the --exclude should work as expected.

1 Like

@ines i am looking at ner.py and i see:

return {
    'view_id': 'ner',
    'dataset': dataset,
    'stream': (eg for _, eg in model(stream)),
    'exclude': exclude
}

in ner.match, so do you also need to put set_hashes there?

No, the hashing should actually be taken care of automatically. Adding the manual to your custom recipe was just an idea to try in your custom my_generator function, to see if it helps us debug the problem.

But I can’t think of a reason why the filtering wouldn’t work, since the hashes created for the same match should be identical. (You could also check this by printing each example in your stream, and comparing the _task_hash with the one you already have in your dataset.)

As you have seen my recipe truncates the text but if the hash is created after the return…yes, it looks strange. Is the hash a simple md5 of the text?

Ok, i can do a test to see if the _task_hash change

By default, Prodigy’s hashes are 32-bit signed integers and we use mmh3. They can take various task properties into account – like the text or image for the input hash, and the input plus the spans, options etc. for the task hash. The set_hashes docs in the PRODIGY_README.html have some more info on this.

That said, this is just the default solution – if you ever feel like you want to implement your own, custom hashing solution, that’s also possible. Prodigy will respect already pre-set hashes on the data. and doesn’t really care about how they’re generated. (For example, depending on your data, you might want to set the same hashes, regardless of the capitalisation.)

@ines the _task_hash is the same:

damiano@damiano:~$ python3 -m prodigy custom.ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

{'text': 'testo di prova', '_input_hash': 397300493, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1271078817, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 14, 'label': 'TEST', 'start': 9, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di er tret ert prova ert ert ert er', '_input_hash': -466006437, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1234389887, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 26, 'label': 'TEST', 'start': 21, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di prova prova sdwef wefwef wef wefwe323', '_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1488445579, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 14, 'label': 'TEST', 'start': 9, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di prova prova sdwef wefwef wef wefwe323', '_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1572291564, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 20, 'label': 'TEST', 'start': 15, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di ddd d ddd d dd dprova ew ewew weprova prova wewef f 4 43', '_input_hash': 395734250, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': -6107930, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 52, 'label': 'TEST', 'start': 47, 'priority': 0.7142857313156128}]}
######################################

i have annotated those task, then after reloading it with:

damiano@damiano:~$ python3 -m prodigy custom.ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py --exclude test

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

{'_input_hash': 397300493, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 14, 'text': 'prova', 'score': 0.7142857313156128, 'start': 9, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1271078817, 'text': 'testo di prova'}
######################################
{'_input_hash': -466006437, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 26, 'text': 'prova', 'score': 0.7142857313156128, 'start': 21, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1234389887, 'text': 'testo di er tret ert prova ert ert ert er'}
######################################
{'_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 14, 'text': 'prova', 'score': 0.7142857313156128, 'start': 9, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1488445579, 'text': 'testo di prova prova sdwef wefwef wef wefwe323'}
######################################
{'_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 20, 'text': 'prova', 'score': 0.7142857313156128, 'start': 15, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1572291564, 'text': 'testo di prova prova sdwef wefwef wef wefwe323'}
######################################
{'_input_hash': 395734250, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 52, 'text': 'prova', 'score': 0.7142857313156128, 'start': 47, 'pattern': 0, 'label': 'TEST'}], '_task_hash': -6107930, 'text': 'testo di ddd d ddd d dd dprova ew ewew weprova prova wewef f 4 43'}
######################################

As you can see the hash is the same, so i wonder how a re-hash could help here.
After a ctrl-r i see No task available.

Should i follow this approach: Continue to annotate same data in new session ?

unfortunately, set_hashes did not fix the problem i still see already annotated tasks.

SOLVED
@ines sorry for stressing you out so much! :slight_smile:
I have changed my generator in this way:

def my_generator(stream, window):
	for _, eg in stream:                                 
		start = eg['spans'][0]['start'] - window if eg['spans'][0]['start']-window>=0 else 0
		end = eg['spans'][0]['end'] + window if eg['spans'][0]['end']+window<=len(eg['text']) else len(eg['text'])        
		eg['text'] = eg['text'][start:end]
		eg['spans'][0]['start'] = eg['spans'][0]['start'] - start
		eg['spans'][0]['end'] = eg['spans'][0]['end'] - start        	
		eg['_task_hash'] = mmh3.hash(eg['text'])		
		yield eg 

forcing a new _tash_hash based on text
Hope this help someone.

Happy Easter :slight_smile:

I’m trying to use the --exclude flag, but I think it isn’t working how I would expect. From researching, it seems to be based off of the _task_hash which is generated from the text and other properties. What properties is it based off of? For example, I’m seeing for a specific example that the priority / score of the span are different whereas everything else (the full text and the span text/start/end/pattern/label) is the same. Is the _task_hash based off of priority and/or score? If so, how do all of the things get combined to be hashed by mmh3? If they are included, if I’m understanding right, perhaps I could make my own generator which simply generates my own _task_hash which is based on everything but those properties (??)

Thanks!

(I’m using version 1.4.1, by the way)

Yes, you can definitely customise how the hashes are generated – check out the docs for the set_hashes helper function. The input_keys and task_keys let you define the string names of the keys you want to include in the respective hashes. The input keys default to ('text', 'image', 'html', 'input') and the task keys to ('spans', 'label', 'options'). A custom generator could then look like this:

stream = (set_hashes(eg, input_keys=('text', 'custom_text')) for eg in stream)

Thanks so much! Sorry, I neglected to look at the documentation on this before. I still think I have a question about the Task hash, though. By default it’s based on ('spans', 'label', 'options'), but the issue that I’m trying to figure out is that while I like that it’s based on the spans (so that it can differentiate the different things it might try to highlight in the same fragment), the spans have a score which seems to change and thus change the task_hash unnecessarily (at least for what I’m trying to do). Here is an example to hopefully make it clearer:

{  
   "text":"This matter was originally scheduled for May 29, 2015, but it was properly noticed for this session in order for this matter to be heard.",
   "_input_hash":-603642722,
   "_task_hash":391498376,
   "spans":[  
      {  
         "start":45,
         "end":47,
         "text":"29",
         "rank":4,
         "label":"WEAPON",
         "score":0.036862158,
         "source":"en_core_web_sm",
         "input_hash":-603642722
      }
   ],
   "meta":{  
      "score":0.036862158
   },
   "answer":"reject"
}
{  
   "text":"This matter was originally scheduled for May 29, 2015, but it was properly noticed for this session in order for this matter to be heard.",
   "_input_hash":-603642722,
   "_task_hash":557712048,
   "spans":[  
      {  
         "start":45,
         "end":47,
         "text":"29",
         "rank":3,
         "label":"WEAPON",
         "score":0.1024216291,
         "source":"en_core_web_sm",
         "input_hash":-603642722
      }
   ],
   "meta":{  
      "score":0.1024216291
   },
   "answer":"reject"
}   

The input_hash in the same for both, but the task_hash is different because the rank and score were different (I assume, at least), and I’m not sure why. Ideally I’d like to not see this same sentence and span again

@cheerfulstoic Thanks a lot for the update – this is interesting and a point we hadn’t really considered before. This might also explain some of the inconsistencies in the past.

The score is the model’s prediction, which varies depending on where you’re at in the annotation process. If the model in the loop was already updated with various annotations, it will score an entity differently than the base model that hasn’t learned anything new yet. That’s why you end up with different scores – and different hashes.

We need to think of a good solution for this. Of course, the hashing shouldn’t just ignore random attributes, because that’d make it unreliable… on the other hand, the score shouldn’t influence the hash, because that makes it a lot less useful :thinking:

In the meantime, if you’re okay with filtering on the input_hash, you could just add a wrapper around your stream that gets all input hashes from the datbase (db.get_input_hashes) and then filters examples with the same hash from the stream. You can also use the built-in filter prodigy.components.filters.filter_inputs that does exactly that – see the docs for the API details.