Hi Ines & Matt,
I'm not sure if my current workflow makes sense, so I would like to have your opinion on the following problem:
As you might recall from my older posts, I'm highly engaged in using the
EntityRuler (in fact, a whole pipeline of chained rulers, that are combining and splitting entities). It started as a construct to boost NER annotation by giving predictions and pre-selecting entities when looking at company imprints.
Well, this worked so well that for many of my desired entities I will stick to this rule based model. But for some entities, these rules a more of a guess, only omitting obvious rejections. An example regarding the address of a company (most interesting case for me):
In many cases, one will find the owner (companyname) directly above or in front of the streetname of the full address. Therefore I mark this line with the
EntitiyRuler. Because this is only a vague guess, I want to postprocess this with a model, to gain further insight if this rule-extracted entity is plausible.
For this, I created a dataset with each of these rule-guessed entities as a separate entry and fed it to the
textcat.teach recipe (using your pretrained
de_core_news_sm model). These entities are spans mostly consisting of 1-8 tokens.
textcat a good approach here? I found Ines' video tutorial and liked the idea of getting a probability at the end and to have a fast (because only binary) annotation process. Because of the rule-based preselection, there will be only very few cases where an entity selection INSIDE of the given span (like in
ner-manual) would be needed.
It really breaks down to the question: "Could this span be a companyname or not?"
After annotating 1200 examples and batch-training (20% eval split) I get a moderate F-score of 0.82 and an accuracy of 0.80. I looked at the
textcat.train-curve and but got
Starting with model de_core_news_sm
Dropout: 0.2 Batch size: 10 Iterations: 10 Samples: 4
25% 0.63 +0.63
50% 0.72 +0.09
75% 0.74 +0.01
100% 0.72 -0.02
Did I something wrong or why do I get an overall lower (and even decreasing) accuracy?
Would it be better in this case to start with a blank model instead of your pretrained one?
Is there a possibility to enhance the speed of this model by deactivating some pipelines that are not needed for textcat?
I know this is (again) a hell of bunch of questions, but I hope you can help me out here.
I think using textcat here does make sense, and it's in line with the advice we're often giving people. You might try improving the accuracy by starting with word vectors. You can try the
de_core_news_md model for that, but you might have better luck with the larger FastText vectors:
python -m spacy init-model de_vectors_web_lg --vectors cc.de.300.vec.gz
It does sound like more data isn't improving the model. You'll probably want to create a dedicated evaluation set before you try out other ways of improving the score, so that you are sure you're running a stable comparison. Once you've split off your evaluation data into a new dataset, I would probably look at the examples the model is getting wrong. This might give you some more insight into how to improve things. For instance, you might find that there are some annotation errors, or you might get a better sense for what distinctions the model isn't learning.
thank you for your FastText vector suggestion.
I used this command and initialized a new model. Batch-training this model did indeed leed to higher accuracies.
So I got bold and tried to use this batch-trained vector buffed model with the
ner.teach recipe. It takes some time to load the model and start the web server, but it finally suceeds ... or not, because the prodigy generated web page is stuck "Loading..." without showing examples.
Am I doing something wrong here?
This sounds like Prodigy isn't able to put together a batch of suggestions to send out. When you set
PRODIGY_LOGGING=basic, is there anything in the logs that looks suspicious? And are the labels you're setting on the command line all present in the model?
Thank you for the hint, I found the problem. In my stupidity, I forgot to give the jsonl dataset I want to process (I only gave the model, the dataset the annotations should be saved in and the label).
But I was surprised that this didn't lead to an error earlier on. FYI the logging:
Z:\NLP_data\models>python -m prodigy ner.teach firstline_train comp_imp_model --label COMPANY_IMP
08:29:05 - APP: Using Hug endpoints (deprecated)
08:29:06 - RECIPE: Calling recipe 'ner.teach'
Using 1 labels: COMPANY_IMP
08:29:06 - RECIPE: Starting recipe ner.teach
08:29:06 - LOADER: Loading stream from jsonl
08:29:06 - LOADER: Reading stream from sys.stdin
08:29:06 - LOADER: Rehashing stream
08:30:14 - RECIPE: Creating EntityRecognizer using model comp_imp_model
08:30:34 - RECIPE: Making sure all labels are in the model
08:30:34 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
08:30:34 - CONTROLLER: Initialising from recipe
08:30:34 - VALIDATE: Creating validator for view ID 'ner'
08:30:34 - DB: Initialising database SQLite
08:30:34 - DB: Connecting to database SQLite
08:30:34 - DB: Loading dataset 'firstline_train' (657 examples)
08:30:34 - DB: Creating dataset '2019-08-27_08-30-34'
08:30:34 - DatasetFilter: Getting hashes for excluded examples
08:30:34 - DatasetFilter: Excluding 657 tasks from datasets: firstline_train
08:30:34 - CONTROLLER: Initialising from recipe
08:30:34 - CORS: initialize wildcard "*" CORS origins
✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
08:30:44 - GET: /project
Task queue depth is 1
08:30:44 - POST: /get_session_questions
08:30:44 - FEED: Finding next batch of questions in stream
08:30:44 - CONTROLLER: Validating the first batch for session: firstline_train-default
08:30:44 - PREPROCESS: Splitting sentences
08:30:44 - FILTER: Filtering duplicates from stream
08:30:44 - FILTER: Filtering out empty examples for key 'text'
... and there it's stuck. As mentioned, giving the correct input file solves the problem!
Glad you got it working! The reason it "worked" is that Prodigy can also read from standard input, so you can leave the source argument out and pipe data forward. For instance, like this:
cat your_data.jsonl | prodigy ner.teach dataset en_core_web_sm
So in your case, it was waiting for something to come in, but nothing happened. In future versions, we probably want to change it to explicitly take
- for reading from standard input (just like a lot of other command line tools). But that'd obviously be a breaking change.
Ah ok, I forgot about this possibility. Again, thank you very much for this additional explanation!