Dear all,
I’ve just started playing around with spaCy/Prodigy. The documentation is nice but there doesn’t seem to be an obvious way to modify the default labels/entities. I wish to start with one of the default English models and fine-tune it for the kind of texts that I will be processing. Specifically, I need to keep some of the default labels (DATE, GPE, etc) and add others (COMMODITY, AGENT, etc).
- How could I do this?
I also have a lot of text where these new labels/entities appear so I need to highlight them using Prodigy.
- What would the workflow look like?
EDIT 1:
I tried to follow the example here: French NER to initialize a model with no labels for English but it’s not working.
This is what I did:
python3 -m spacy init-model en ./lang/en_vectors_comm --vectors ~/Downloads/crawl-300d-2M.vec
Reading vectors from /home/user/Downloads/crawl-300d-2M.vec
Open loc
1999995it [03:14, 10274.76it/s]
Creating model…
0it [00:00, ?it/s]
Sucessfully compiled vocab
1999715 entries, 1999995 vectors
Unfortunately, when trying to create the gold standard it fails:
prodigy ner.make-gold my_dataset ./lang/en_vectors_comm/ ~/Shared/my_corpus.jsonl --label ~/Documents/my_labels
Using 7 labels from /home/user/Documents/my_labels
✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
12:31:28 - Exception when serving /get_questions
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/waitress/channel.py", line 338, in service
task.service()
File "/usr/local/lib/python3.6/dist-packages/waitress/task.py", line 169, in service
self.execute()
File "/usr/local/lib/python3.6/dist-packages/waitress/task.py", line 399, in execute
app_iter = self.channel.server.application(env, start_response)
File "/usr/local/lib/python3.6/dist-packages/hug/api.py", line 423, in api_auto_instantiate
return module.__hug_wsgi__(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/falcon/api.py", line 244, in __call__
responder(req, resp, **params)
File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 793, in __call__
raise exception
File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 766, in __call__
self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 703, in call_function
return self.interface(**parameters)
File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 100, in __call__
return __hug_internal_self._function(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/prodigy/app.py", line 105, in get_questions
tasks = controller.get_questions()
File "cython_src/prodigy/core.pyx", line 109, in prodigy.core.Controller.get_questions
File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.get_questions
File "cython_src/prodigy/components/feeds.pyx", line 61, in prodigy.components.feeds.SharedFeed.get_next_batch
File "cython_src/prodigy/components/feeds.pyx", line 130, in prodigy.components.feeds.SessionFeed.get_session_stream
File "/home/user/.local/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
return next(iter(seq))
File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 209, in make_tasks
for doc, eg in nlp.pipe(texts, as_tuples=True):
File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 548, in pipe
for doc, context in izip(docs, contexts):
File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 572, in pipe
for doc in docs:
File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 551, in <genexpr>
docs = (self.make_doc(text) for text in texts)
File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 544, in <genexpr>
texts = (tc[0] for tc in text_context1)
File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 208, in <genexpr>
texts = ((eg['text'], eg) for eg in stream)
File "cython_src/prodigy/components/preprocess.pyx", line 118, in add_tokens
File "cython_src/prodigy/components/preprocess.pyx", line 42, in split_sentences
File "doc.pyx", line 535, in __get__
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.