Understanding textcat.teach from PyData Berlin 2018 talk

textcat
solved

(Geoff) #1

Hello Matthew, I have a similar problem to solve to the problem you talked about on PyData conference
PyData Berlin 2018 and I’m trying to replicate the example you showed on the slide

The first line prodigy textcat.teach crime_dataset /data.jsonl --label CRIME It doesn’t work for me because it wants me to specify spacy model as the second parameter. So I’m wondering how did it work in your example? The same question applies to the second prodigy command prodigy ner.teach ner_dataset /data.json --label PERSON, LOCATION

Could you please provide some clarity on how to replicate the problem you talked about there?

Thank you


(Matthew Honnibal) #2

Hi @geoff,

I made a typo when I was putting together the slide, you’re right that the command is wrong there. It should be fixed in the slideshare, but, hard to fix the video :p. It should work if you specify the spaCy model — something like en_core_web_md should be fine.


(Geoff) #3

Thank you for a quick reply. I have follow up question. I see that here prodigy textcat.teach crime_dataset en_core_web_md /data.jsonl --label CRIME you don’t specify any initial training data I mean --seed or --patterns. Is it fine just to start annotating without these initial information? To give a little background what I’m trying to solve I want a model tell me weather there is an address present in the text.


(Matthew Honnibal) #4

If you’re labelling an entity that the model already predicts, you can use the current state of the model as a starting point. But if you’re annotating a new entity, you do need to do something else to add the initial entities.

I would suggest starting with a round of ner.manual annotation, to train an initial model. After that you can use ner.teach to improve its predictions.