training a new entity type with Prodigy


(Byron James) #1

Still having issues getting this right.

(Ines Montani) #2

Hi, could you share some more details or what you’re trying to do?

(Byron James) #3

I’ve got a taxonomy of roughly 45k terms; root terms, synonyms etc. A complete ontology for each. It exists as a JSON dataset and a service (paid, unfortunately and from a 3rd party). I’ve been taking runs at bringing this taxonomy in (ingesting?) it and creating or adding it to a vector. I know my way around taxonomies and I’m not new to developing code (albeit in c#) so python isn’t a huge leap - but this is python plus prodigy plus (eventually) spacy - let’s just say it’s one step forward, one step backwards and slightly to one side.

45,000 terms is a lot to consume and coming up with a clean, efficient way to either make these terms and their ontology into a vector (or a useful language model) is the goal. Oh, and making it as stress free on the humans as is possible and dictated by the UNHCR.


(Matthew Honnibal) #4

I think one of the things that’s often tricky when getting started is to sort out all the concepts. We’ve tried to make this easy, but there’s definitely a lot of ground to cover, and it’s always going to be hard to match up what we’re trying to explain to the specifics of what you’re trying to do.

I think one thing that’s worth pointing out is that both spaCy and Prodigy are really focused around training models that can predict annotations on running text. The idea is that you want to create some tool that takes in a bunch of running sentences, and then the tool will tell you which words in the sentences are names, or which sentences fall into certain categories, etc.

Do you have running text, as well as your terminology list? If all you have are terms — without any examples of the terms in context — there’s actually not really much that NLP can do for you. A list of words without any context around them isn’t actually language, after all. Probably the only things that are helpful in spaCy for that situation are the word vectors, which can help you find words that are similar to each other. The terms.teach tool in Prodigy is helpful for that as well.

If what you’re trying to do is recognise entities in running text, and you have an ontology with terms that you know are ways some entities can be written, there are a few ways you can try to use your terminology lists in spaCy and Prodigy. But I just want to make sure I understand the goal before I go into too much detail about that.

(Byron James) #5

Hi, I agree. It’s a lot to line up. I do have a running texts, rather large volumes of material that comes into a DMP from a number of sources.

I’m wondering if perhaps a more flexible idea might be to train/teach a vector that contains the terms and ontology specific to the domain, a subset of the larger taxonomy, rather than tackle the entire thing.