Avoid restarting from zero...

damiano · March 30, 2018, 8:02am

Hi,
i have noticed that ner.match start the annotation from the beginning at every annotation session. Why? I see the same samples i already have annotated.

When i reboot my pc and start the session again with:

python3 -m prodigy ner.match email it_core_news_sm cv.jsonl --patterns patterns.jsonl

prodigy start from the first sample on my cv.jsonl data source.

I have around 80k sentences, i would like to continue with new sentences, is that possible?

Thanks

ines · March 30, 2018, 9:24am

By default, Prodigy makes a little assumptions about your stream as possible. But you can tell it to exclude annotations from one or more datasets via the --exclude option, e.g. --exclude email. You can exclude one or more dataset names (comma-separated list), including the current dataset. This is also very useful for evaluation data, because you want to make sure that no training examples end up in your evaluation set, and vice versa.

damiano · March 30, 2018, 11:46am

@ines thank you for your quick reply but i still have the same problem.

I have created a custom recipe that is 95% the same as ner.match. Basically this recipe truncates the text near the entity i should accept/reject. It is quite useful because as i already told you i am dealing with long texts. The code is:

def my_generator(stream, window):
	for _, eg in stream:                                 
		start = eg['spans'][0]['start'] - window if eg['spans'][0]['start']-window>=0 else 0
		end = eg['spans'][0]['end'] + window if eg['spans'][0]['end']+window<=len(eg['text']) else len(eg['text'])        
		eg['text'] = eg['text'][start:end]
		eg['spans'][0]['start'] = eg['spans'][0]['start'] - start
		eg['spans'][0]['end'] = eg['spans'][0]['end'] - start        
		yield eg

@recipe('custom.ner.match',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        patterns=recipe_args['patterns'],
        source=recipe_args['source'],
        window=plac.Annotation("window", type=int),
        api=recipe_args['api'],
        loader=recipe_args['loader'],        
        exclude=recipe_args['exclude']) 
def custom_ner_match(dataset, spacy_model, patterns, source, window=100,
                     api=None, loader=None, exclude=None):                         
    """
    Suggest phrases that match a given patterns file, and mark whether they
    are examples of the entity you're interested in. The patterns file can
    include exact strings, regular expressions, or token patterns for use with
    spaCy's `Matcher` class.
    """
    log("RECIPE: Starting recipe ner.match", locals())
    DB = connect()
    # Create the model, using a pre-trained spaCy model.
    model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
    log("RECIPE: Created PatternMatcher using model {}".format(spacy_model))
    if dataset is not None and dataset in DB:
        existing = DB.get_dataset(dataset)
        log("RECIPE: Updating PatternMatcher with {} examples from dataset {}"
            .format(len(existing), dataset))
        model.update(existing)
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')

    return {
        'view_id': 'ner',
        'dataset': dataset,
        'stream': my_generator(model(stream), window),
        'exclude': exclude
    }

Running it via

python3 -m prodigy custom.ner.match email it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py --exclude email

I still see the first sentence that i have annotated (i double-checked it with db-out and i already see the annotation).

However ner.match gives me the same problem. Any workaround?
Thanks

ines · March 30, 2018, 11:51am

Just to confirm: You’re using the latest version, v1.4.1, right?

damiano · March 30, 2018, 12:06pm

Yes @ines

damiano@damiano:~$ python3 -m prodigy stats

  ✨  Prodigy stats

  Version            1.4.1              
  Total Sessions     77                 
  Prodigy Home       /home/damiano/.prodigy 
  Location           /home/damiano/.local/lib/python3.5/site-packages/prodigy 
  Platform           Linux-4.13.0-37-generic-x86_64-with-Ubuntu-16.04-xenial 
  Total Datasets     3                  
  Database Name      SQLite             
  Python Version     3.5.2              
  Database Id        sqlite

damiano · March 31, 2018, 8:52am

@ines I have noticed that the problem only occurs when i open the browser the first time.
Let me explain… I run this command:

python -m prodigy ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl --exclude test

then i see the normal prodigy output:

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

When i open the browser i am ABLE to annotate the same (already annotated) sentences BUT if i call http://localhost:8080 a second time i see: No tasks available.

so, basically a ctrl- r fix the problem for the moment

Hope this helps.

ines · March 31, 2018, 9:49am

Thanks for updating! And that’s interesting – this would indicate that some examples are not filtered out correctly in the first batch. Basically, what happens when you reload the web app is that it makes a new request to /get_questions, which asks Prodigy for a new batch of tasks.

Another thing you could try is rehashing the tasks manually in your my_generator function:

from prodigy import set_hashes

# your other code here
yield set_hashes(eg)

This will ensure that each examples receives unique hashes that let Prodigy determine whether a question has been asked before. If this produces different hashes than the ones Prodigy created before, you might still see the same examples one more time – but once the hashes are in the dataset, the --exclude should work as expected.

damiano · March 31, 2018, 10:21am

@ines i am looking at ner.py and i see:

return {
    'view_id': 'ner',
    'dataset': dataset,
    'stream': (eg for _, eg in model(stream)),
    'exclude': exclude
}

in ner.match, so do you also need to put set_hashes there?

ines · March 31, 2018, 10:30am

No, the hashing should actually be taken care of automatically. Adding the manual to your custom recipe was just an idea to try in your custom my_generator function, to see if it helps us debug the problem.

But I can’t think of a reason why the filtering wouldn’t work, since the hashes created for the same match should be identical. (You could also check this by printing each example in your stream, and comparing the _task_hash with the one you already have in your dataset.)

damiano · March 31, 2018, 10:34am

As you have seen my recipe truncates the text but if the hash is created after the return…yes, it looks strange. Is the hash a simple md5 of the text?

Ok, i can do a test to see if the _task_hash change

ines · March 31, 2018, 10:39am

By default, Prodigy’s hashes are 32-bit signed integers and we use mmh3. They can take various task properties into account – like the text or image for the input hash, and the input plus the spans, options etc. for the task hash. The set_hashes docs in the PRODIGY_README.html have some more info on this.

That said, this is just the default solution – if you ever feel like you want to implement your own, custom hashing solution, that’s also possible. Prodigy will respect already pre-set hashes on the data. and doesn’t really care about how they’re generated. (For example, depending on your data, you might want to set the same hashes, regardless of the capitalisation.)

damiano · March 31, 2018, 6:05pm

@ines the _task_hash is the same:

damiano@damiano:~$ python3 -m prodigy custom.ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

{'text': 'testo di prova', '_input_hash': 397300493, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1271078817, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 14, 'label': 'TEST', 'start': 9, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di er tret ert prova ert ert ert er', '_input_hash': -466006437, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1234389887, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 26, 'label': 'TEST', 'start': 21, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di prova prova sdwef wefwef wef wefwe323', '_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1488445579, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 14, 'label': 'TEST', 'start': 9, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di prova prova sdwef wefwef wef wefwe323', '_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': 1572291564, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 20, 'label': 'TEST', 'start': 15, 'priority': 0.7142857313156128}]}
######################################
{'text': 'testo di ddd d ddd d dd dprova ew ewew weprova prova wewef f 4 43', '_input_hash': 395734250, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, '_task_hash': -6107930, 'spans': [{'score': 0.7142857313156128, 'text': 'prova', 'pattern': 0, 'end': 52, 'label': 'TEST', 'start': 47, 'priority': 0.7142857313156128}]}
######################################

i have annotated those task, then after reloading it with:

damiano@damiano:~$ python3 -m prodigy custom.ner.match test it_core_news_sm cv.jsonl --patterns patterns.jsonl -F /home/damiano/recipe.py --exclude test

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

{'_input_hash': 397300493, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 14, 'text': 'prova', 'score': 0.7142857313156128, 'start': 9, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1271078817, 'text': 'testo di prova'}
######################################
{'_input_hash': -466006437, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 26, 'text': 'prova', 'score': 0.7142857313156128, 'start': 21, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1234389887, 'text': 'testo di er tret ert prova ert ert ert er'}
######################################
{'_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 14, 'text': 'prova', 'score': 0.7142857313156128, 'start': 9, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1488445579, 'text': 'testo di prova prova sdwef wefwef wef wefwe323'}
######################################
{'_input_hash': -1827486154, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 20, 'text': 'prova', 'score': 0.7142857313156128, 'start': 15, 'pattern': 0, 'label': 'TEST'}], '_task_hash': 1572291564, 'text': 'testo di prova prova sdwef wefwef wef wefwe323'}
######################################
{'_input_hash': 395734250, 'meta': {'score': 0.7142857313156128, 'pattern': 0, 'source': 'Tuoagente'}, 'spans': [{'priority': 0.7142857313156128, 'end': 52, 'text': 'prova', 'score': 0.7142857313156128, 'start': 47, 'pattern': 0, 'label': 'TEST'}], '_task_hash': -6107930, 'text': 'testo di ddd d ddd d dd dprova ew ewew weprova prova wewef f 4 43'}
######################################

As you can see the hash is the same, so i wonder how a re-hash could help here.
After a ctrl-r i see No task available.

damiano · March 31, 2018, 7:08pm

Should i follow this approach: Continue to annotate same data in new session ?

damiano · April 1, 2018, 8:28am

unfortunately, set_hashes did not fix the problem i still see already annotated tasks.

damiano · April 1, 2018, 10:10am

SOLVED
@ines sorry for stressing you out so much!
I have changed my generator in this way:

def my_generator(stream, window):
	for _, eg in stream:                                 
		start = eg['spans'][0]['start'] - window if eg['spans'][0]['start']-window>=0 else 0
		end = eg['spans'][0]['end'] + window if eg['spans'][0]['end']+window<=len(eg['text']) else len(eg['text'])        
		eg['text'] = eg['text'][start:end]
		eg['spans'][0]['start'] = eg['spans'][0]['start'] - start
		eg['spans'][0]['end'] = eg['spans'][0]['end'] - start        	
		eg['_task_hash'] = mmh3.hash(eg['text'])		
		yield eg

forcing a new _tash_hash based on text
Hope this help someone.

Happy Easter

cheerfulstoic · April 17, 2018, 7:31pm

I’m trying to use the --exclude flag, but I think it isn’t working how I would expect. From researching, it seems to be based off of the _task_hash which is generated from the text and other properties. What properties is it based off of? For example, I’m seeing for a specific example that the priority / score of the span are different whereas everything else (the full text and the span text/start/end/pattern/label) is the same. Is the _task_hash based off of priority and/or score? If so, how do all of the things get combined to be hashed by mmh3? If they are included, if I’m understanding right, perhaps I could make my own generator which simply generates my own _task_hash which is based on everything but those properties (??)

Thanks!

cheerfulstoic · April 17, 2018, 8:29pm

(I’m using version 1.4.1, by the way)

ines · April 18, 2018, 2:11pm

Yes, you can definitely customise how the hashes are generated – check out the docs for the set_hashes helper function. The input_keys and task_keys let you define the string names of the keys you want to include in the respective hashes. The input keys default to ('text', 'image', 'html', 'input') and the task keys to ('spans', 'label', 'options'). A custom generator could then look like this:

stream = (set_hashes(eg, input_keys=('text', 'custom_text')) for eg in stream)

cheerfulstoic · May 9, 2018, 8:34pm

Thanks so much! Sorry, I neglected to look at the documentation on this before. I still think I have a question about the Task hash, though. By default it’s based on ('spans', 'label', 'options'), but the issue that I’m trying to figure out is that while I like that it’s based on the spans (so that it can differentiate the different things it might try to highlight in the same fragment), the spans have a score which seems to change and thus change the task_hash unnecessarily (at least for what I’m trying to do). Here is an example to hopefully make it clearer:

{  
   "text":"This matter was originally scheduled for May 29, 2015, but it was properly noticed for this session in order for this matter to be heard.",
   "_input_hash":-603642722,
   "_task_hash":391498376,
   "spans":[  
      {  
         "start":45,
         "end":47,
         "text":"29",
         "rank":4,
         "label":"WEAPON",
         "score":0.036862158,
         "source":"en_core_web_sm",
         "input_hash":-603642722
      }
   ],
   "meta":{  
      "score":0.036862158
   },
   "answer":"reject"
}
{  
   "text":"This matter was originally scheduled for May 29, 2015, but it was properly noticed for this session in order for this matter to be heard.",
   "_input_hash":-603642722,
   "_task_hash":557712048,
   "spans":[  
      {  
         "start":45,
         "end":47,
         "text":"29",
         "rank":3,
         "label":"WEAPON",
         "score":0.1024216291,
         "source":"en_core_web_sm",
         "input_hash":-603642722
      }
   ],
   "meta":{  
      "score":0.1024216291
   },
   "answer":"reject"
}

The input_hash in the same for both, but the task_hash is different because the rank and score were different (I assume, at least), and I’m not sure why. Ideally I’d like to not see this same sentence and span again

ines · May 10, 2018, 5:30pm

@cheerfulstoic Thanks a lot for the update – this is interesting and a point we hadn’t really considered before. This might also explain some of the inconsistencies in the past.

The score is the model’s prediction, which varies depending on where you’re at in the annotation process. If the model in the loop was already updated with various annotations, it will score an entity differently than the base model that hasn’t learned anything new yet. That’s why you end up with different scores – and different hashes.

We need to think of a good solution for this. Of course, the hashing shouldn’t just ignore random attributes, because that’d make it unreliable… on the other hand, the score shouldn’t influence the hash, because that makes it a lot less useful

In the meantime, if you’re okay with filtering on the input_hash, you could just add a wrapper around your stream that gets all input hashes from the datbase (db.get_input_hashes) and then filters examples with the same hash from the stream. You can also use the built-in filter prodigy.components.filters.filter_inputs that does exactly that – see the docs for the API details.

Topic		Replies	Views
Continue to annotate same data in new session enhancement , done	19	4024	October 5, 2018
Custom templates with custom DB and exclude logic usage , custom , solved	20	3073	January 29, 2018
Restarting Prodigy with a new session usage , solved	9	2028	October 28, 2022
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1683	December 25, 2019
ner.correct --exclude not excluding duplicate tasks bug , ner	17	1845	December 7, 2021

Avoid restarting from zero...

Related topics