terms.teach : seeds not added to dataset (hence not in patterns)

In the terms.teach recipe, the seed terms are not added to the dataset.
It follows that the seed terms do not end up in the list of patterns generated by terms.to-patterns.

If I understood the intended behavior of this recipe correctly,

DB.add_examples(seed_tasks)

should be :

DB.add_examples(seed_tasks, datasets=[dataset])

Thanks for the report – I think you’re right! Sorry about that. Already fixed it and we’ll ship it with the upcoming release!

Great, thanks @ines !
Do you already have a tentative date for the next release?

A broader question about the intended use of this recipe: Do you plan to update terms.teach so it lends itself to successive sessions of annotation?
The current recipe does not have --exclude <dataset>, which means the same terms can be annotated in different sessions.
An even better solution could be to add a parameter --resume that would also (re-)build accept_doc and reject_doc from the annotated examples, instead of starting with an empty reject_doc and only the seed terms in accept_doc.

Ah, yes, I really like those ideas! This should be easy to implement, so we might be able to get this fix into the next version as well.

We were actually going to release it last weekend already – but we kept making really good progress on some of the outstanding issues and features, so the next release will be v1.5.0 and include stream/task validation, entry points to plug in custom loaders and databases, various other small fixes and maybe even the manual annotation interface :slightly_smiling_face: I hope I can get the first internal test build running tonight!

1 Like

Just released v1.5.0, which includes a fix for this, and also introduces the --resume flag :tada:

1 Like

I saw that and quickly tested, it works just fine, thanks a lot for your reactivity (and congrats for the new release) !

One caveat for terms.teach: two consecutive runs with the same seeds result in duplicate entries in the database. We should probably check and only add seeds that are not already in the DB.

As a side note, the changelog mentions a --resume option for ner.match but there’s no such option. If I read the code correctly, this recipe always resumes work if it operates on an existing dataset :slight_smile: (In fact, terms.teach should probably resume by default too ; Now I can’t think of any good reason we might not want this.)

Sure, that's a good idea! (Then again, the recipe does allow starting with no seed terms, so if you resume, I guess it makes sense not to specify seed terms.)

Ah damn, for some reason, that fix didn't make it into the release? I'll double check again tomorrow and fix the release notes. (I squeezed this update in kinda last minute in response to a thread on here.)

About the defaults: This is kinda tricky and we thought about this a lot... at the moment, Prodigy tends to assume as little as possible about what the datasets "mean" and how existing annotations should be related back to your current annotation session. That's also why all of this functionality is currently opt-in only.

@ines:
Sure, that’s a good idea! (Then again, the recipe does allow starting with no seed terms, so if you resume, I guess it makes sense not to specify seed terms.)

I fully agree with you but see at least two situations where the user can currently add seed terms more than once involuntarily:

  1. The first invocation(s) of terms.teach with seeds fail because the vectors argument does not point to an existing model or word2vec file. (It happened to me yesterday when I installed the latest version of prodigy in a new conda env.) This could be prevented by loading the vectors before the seed terms are added to the DB.
  2. The user maintains an external list of seed terms that they grow over time (eg. following emerging trends in various feeds they monitor and annotate). So, periodically, they resume terms.teach with an updated list of seed terms. Maybe another prodigy recipe or a different workflow would be better suited, but I think most users in this situation would just do that (and introduce duplicates).

@ines:
About the defaults: This is kinda tricky and we thought about this a lot… at the moment, Prodigy tends to assume as little as possible about what the datasets “mean” and how existing annotations should be related back to your current annotation session. That’s also why all of this functionality is currently opt-in only.

I'm impressed by Prodigy's UX, so I fully trust you to make the right decisions including ignoring my suggestions when you know better :slight_smile: