First I’d like to say congratulations on producing a fantastic tool.
Second I have a large file that I would like to annotate: is it possible to start annotating it in one session, end this session, then return in another session and continue annotating at the same point the previous session ended? Currently the annotation appears to start again from the beginning of the file if I start a new session.
Thanks for the question and sorry about the late reply.
We’ve actually been going back and forth on this, and the decision of whether to make this a feature or even the default behaviour (which is probably not a bad idea, as it makes re-annotating harder and can lead to longer loading times for large datasets). The setting also needs to be recipe-specific – for some recipes within the same workflow, you might want to use it, and for others, you might want to turn it off.
The good news is, all the underlying logic needed is already there: As the examples come in, Prodigy assigns two hashes to them – an input hash based on the task text or image and a task hash, based on the input and the additional features, like the spans or label.
So I’d suggest we add a filter function that filters out tasks with the same task hash that already exist in the current dataset, and that can be toggled on a per-recipe basis. In the meantime, you can also just implement this functionality via a custom recipe by adding an on_load callback that gets the task hashes from the database, and modifies the stream so that tasks with the same task hash are filtered out:
import prodigy
from prodigy.recipes.ner import teach # import the ner.teach recipe
from prodigy.util import TASK_HASH_ATTR # the task hash attr constant ('_task_hash')
# you could also hard-code this, but using the constant is cleaner
@prodigy.recipe('custom-recipe')
def custom_recipe(dataset, model, source, loader): # etc.
# recipes are Python functions that return a dictionary of components,
# so you can just call them and receive back a dict to return or overwrite
components = teach(dataset, model, source=source, loader=loader)
def on_load(ctrl):
# this function is called on load and gives you access to the controller,
# which includes the database
task_hashes = ctrl.db.get_task_hashes(dataset) # get task hashes
# overwrite the stream and filter out examples with task hashes that already exist
components['stream'] = (eg for eg in stream if eg[TASK_HASH_ATTR] not in task_hashes)
components['on_load'] = on_load # set an on_load option in the components
return components # return the dict of components
Of course, the above would also work for a custom recipe. For more details on how to wrap built-in recipes, see this comment.
Thanks for the reply, however I still can’t get it to work
To get your code to run I defined stream as components[‘stream’] and replaced eg[TASK_HASH] with eg[TASK_HASH_ATTR], is this correct?
If I understand the documentation (and other examples) correctly the task hash should be constant for the same span and label?
I investigated the output after using prodigy.util.set_hashes:
task_keys = ('spans', 'label')
prodigy.util.set_hashes(eg, task_keys=task_keys) for eg in stream
and in different annotating sessions the task_hash is different even though the ‘spans’ and ‘label’ (and input_hash) are the same, is this the behavior you would expect?
Ah sorry, this was a typo. Yes, it's TASK_HASH_ATTR.
This is definitely not expected behaviour.
The input hash is based on the text, image etc. and the task hash on the input hash, plus spans and/or label. The values are dumped into a string, sorted and hashed using mmh3. If no task hash keys are available in the example, the dumped JSON of the full task is used. By default, the stream is rehashed – but if the values are identical, the hashes should always match. I don't remember if this is currently documented, but there's also a prodigy.util.get_hash function that takes a task, the keys and an optional prefix and returns a hash:
Do you have an example task and the generated, mismatched hashes, so we can try reproduce this issue? (Thanks for looking into this btw, much appreciated!)
On a positive note, we re-enabled and improved the --exclude flag for the upcoming version and it seems to work fine so far (assuming there's no deeper bug in the hashing). So when you run a recipe that supports streaming in data, you'll be able to specify --exclude current_dataset and it will filter already annotated examples from the stream. This should also be helpful for creating more reliable evaluation sets.
Sorry if this isn't in the best format for you (wasn't sure how to get the task) but here are two example outputs after using set_hashes:
{'_input_hash': 861280820, 'text': 'they only last for a few minutes but it still concerns me.', '_task_hash': -598293614, 'spans': [{'start': 5, 'label': 'TIME', 'text': 'only', 'source': 'core_web_sm', 'input_hash': 861280820, 'rank': 2, 'end': 9, 'score': 0.018179588478833608}], 'meta': {'score': 0.018179588478833608}}
{'spans': [{'end': 9, 'input_hash': 861280820, 'rank': 2, 'text': 'only', 'score': 0.018179588478833608, 'source': 'core_web_sm', 'label': 'TIME', 'start': 5}], 'meta': {'score': 0.018179588478833608}, 'text': 'they only last for a few minutes but it still concerns me.', '_input_hash': 861280820, '_task_hash': 1925984069}
Could the issue be that 'spans' has the same contents every time but the order is randomised?
I was thinking this, too – but in theory, it shouldn't matter, because get_hash uses json.dumps with sort_keys=True.
BUT – after a long debugging session, going through the rehashing across the application and wondering wtf is going on here – I found the problem! It turns out sort_keys wasn't applied correctly because of of a mismatched bracket.... arghhh.
Just fixed it and will add better hashing tests, so this problem should be resolved in the next release.
It's such a stupid bug that it's hard to work around in the meantime... but as a hacky solution for now, you could simply filter based on the INPUT_HASH_ATTR and accept that it'll also exclude other entities on the same input texts. If your input texts are all fairly short, this should probably be fine. (Alternatively you could rehash the stream by setting overwrite=True on set_hashes, using only "meta" as the task key and hope that the likelyhood of two entities being assigned the exact same score on the same input text is pretty much zero. But this just sounds too hacky to be a good idea...)
@ines
Hey ,
If I want to annotate an already existing stream again( say I want to annotate every single element of existing stream thrice), so should I remove the input and task hashes and add the sentence thrice in the stream?
My doubt is that will it automatically assign different input, task hash to all three of them , if I add a variable field in each one of them?
If I want to use this statement : task_hashes = ctrl.db.get_task_hashes(dataset)
without ‘on_load’ function , then how to avoid this error : ctrl is not defined.
The variable we've called ctrl in our code examples is the controller, which is passed to the on_load and on_exit callback for convenience and easy access to the annotation session details. In your case, you want to call a database method – so you can also access the database directly:
from prodigy.components.db import connect
db = connect() # uses the prodigy.json settings
task_hashes = db.get_task_hashes('your_dataset_name')
It looks like you might have come across a bug that was present in an older version of Prodigy. Could you check which version you're using and upgrade to v1.4.1?
When you load in a stream of examples, Prodigy will assign the hashes automatically, and also filter out identical questions. If you've already annotated examples in a dataset, you can use the --exclude option to avoid repetition of examples that are already in one or more datasets. For example:
This is also not working.I think it’s the version issue right ?
I will first upgrade prodigy then try this method .
Thanks a lot.I will get back to you in case of any problem .
Ah, yes, the example I used refers to the latest version of Prodigy (which now has an additional argument unsegmented – but you could also leave that out). In any case, you should definitely upgrade, though – the latest version includes various fixes to the hashing and exclude logic that should be very relevant to what you’re doig.