NER document Labeling

That looks correct! Could you run the command with PRODIGY_LOGGING=basic and share the output? So basically:

PRODIGY_LOGGING=basic python -m prodigy ner.manual company_details_dataset en_core_web_sm your_converted_data.jsonl --label COMPANY_TYPE,COMPANY_INFORMATION, COMPANY_NAME,COMPANY_DEPARTMENT,COMPANY_ADDRESS,COMPANY_COUNTRY_USA,EMAIL,NAME

‘PRODIGY_LOGGING’ is not recognized as an internal or external command,
operable program or batch file.

What operating system are you on? In any case, you have to set the environment variable PRODIGY_LOGGING to basic – so if you google “set environment variable” plus your OS / environment, it should tell you how to do it :slightly_smiling_face:

My OS: windows
set the environment variables below. Still cant recognize. Restarted the machine too.
PRODIGY_HOME=C:\Users\aaa.bbb\.prodigy
PRODIGY_LOGGING=basic

I think you might have to call set? See here:

No I meant i did set properly with set at first instance (before your message). Everything is in environment variables. Still not recognized. Sorry for bothering.

12:01:15 - GET: /project
Task queue depth is 1
Task queue depth is 1
12:01:15 - POST: /get_session_questions
12:01:15 - FEED: Finding next batch of questions in stream
12:01:15 - CONTROLLER: Validating the first batch for session: data_100-default
12:01:15 - PREPROCESS: Tokenizing examples
12:01:15 - FILTER: Filtering duplicates from stream
12:01:15 - FILTER: Filtering out empty examples for key ‘text’
Exception when serving /get_session_questions

I solved this my upgrading murmurhash.

1 Like

image

1)There are new lines symbols at the end of each line. is it common in Podigy?.

2)Also if there is a paragraph with new lines which i need to tag it as COMPANY_INFORMATION, but there is a contact information inside the paragraph. So i need to do label3 inside label4. But UI is not allowing me to do. is it possible to configure Prodigy somewhere to allow that option?

3)**555 **
Bloemfontein
South Africa
can i label those three lines as one label called COMPANY_ADDRESS.?. OR does it need to be in one line?

There are several things here: Yes, knowing where a newline is is usually very important when you’re annotating named entities. Newlines are tokens and you never want to accidentally highlight them (and without the symbols, they’d be pretty much invisible). You can hide them by setting "hide_true_newline_tokens": false in your prodigy.json, but I usually wouldn’t recommend it, because it can easily lead to inconsistent annotations.

Alternatively, you might also consider preprocessing that normalises the whitespace. If you’re training a model later on, just make sure to also pre-process your inputs at runtime to make sure it matches the training data.

If you want to train a named entity recognition model (especially with spaCy), training it to predict overlapping spans isn’t possible. By definition, a token can only be part of one entity. That’s also why you can’t highlight overlapping spans. You can always make several passes over the data to capture nested spans, but I’m not sure that’s the best solution here. It really depends on what you want to do with the data later on and what statistical model you want to train.

Thanks for your input. I was thinking of replacing new lines with spaces. Then the content will be in one big chunk of paragraph to label. Do you think its a good idea?. As i need to label company information as well which some times multi line information which is hard to label.

@mystuff You don’t necessarily have to replace all newlines – you’d just have to make sure that the tokenizer produces separate tokens for newlines. For example, by replacing double newlines with single newlines. Or you could add a custom tokenization rule that always splits on \n.

I have annotated 150 htmls, the raw text is separated by new lines. Now i need to run "prodigy ner.batch-train ". am i right?.

python -m prodigy ner.batch-train company_details_dataset en_core_web_sm --output company_model --label COMPANY_TYPE,COMPANY_INFORMATION, COMPANY_NAME,COMPANY_DEPARTMENT,COMPANY_ADDRESS,COMPANY_COUNTRY_USA,EMAIL,NAME

when i run above train command that i am getting below error:
File “transition_system.pyx”, line 148, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

It seems that there are some white spaces issue so i followed below post to fix it.

Still getting same error. I can see many of these in the dataset {“text”:"\n",“start”:3359,“end”:3360,“id”:594},. do you think its a tokenization issue?. If so, how do i pass new line tokenizer while running "prodigy ner.batch-train "

Tokens containing \n are totally fine – it’s only a problem if labelled entity spans in the "spans" start or end with a newline token, or consist only of newline tokens. This is an explicit change to the entity recognizer in spaCy v2.1 to make it more accurate and to prevent it from predicting entity spans like this, which are usually never what you want.

So if your data contains entries in the "spans" that are invalid like that, you should be able to just remove them and then re-import the edited data to a new dataset.

Thanks for your reply. I have found 3 spans with white spaces and \n and i removed and reloaded back to completley new dataset. Still getting same error.

Just for testing i tested with top4 and didnt get any error.

python -m prodigy ner.batch-train top4_ner en_core_web_sm --output top4-model --eval-split 0.2 --n-iter 6 --dropout 0.2
The output for top4:
17:09:14 - MODEL: Merging entity spans of 0 examples
17:09:14 - MODEL: Using 0 examples (without ‘ignore’)
17:09:14 - MODEL: Evaluated 0 examples
06 1884.572 0 0 0 0 0.000

Correct 0
Incorrect 0
Baseline 0.000
Accuracy 0.000

17:09:14 - RECIPE: Restoring disabled pipes: [‘tagger’, ‘parser’]

Model: C:\top4-model
Training data: C:\top4-model\training.jsonl

does it look alright for you?. if so, can i redo entire annotation for those 100 .

i finally fixed my issue by adding all white space chars like \r\t. not just " ". when i ran ner.batch-train below is the output. I have used default batch size. Also there is no duplicates in data.

Correct 420
Incorrect 419
Baseline 0.000
Accuracy 0.501

How do i improve accuracy? is it by adding more data(currently it has 150.)?

Yes, adding data should definitely be the first step. 150 examples is very low, so you won’t be seeing very reliable results.

Thought so. but just want to confirm. Thanks for the reply. Your reply means a lot, you gave me confidence that i am on right direction.

Hello @ines , I have increased the dataset from 150 to 300 using ner.manual.
Annotated new 150 and merged those with previous 150.
python -m prodigy ner.batch-train dataset_300 en_core_web_sm --output model_300 --label ........
The accuracy only increased 0.8%. May i know where i am doing wrong?. Is there a way to debug the accuracy?

dataset_150:
Correct 420
Incorrect 419
Baseline 0.000
Accuracy 0.501

dataset_300:
Correct 831
Incorrect 597
Baseline 0.000
Accuracy 0.582

300 examples is still a very low number of examples. To really be able to trust your results, you typically want a lot more - maybe like 1000 or 2000.

If you haven't seen it yet, check our my NER flowchart for some more tips:

Thanks for the detailed Flowchart. In that flow chart, it says 1000 sentences not 1000 documents. am i right?. I have more than 4000 sentences in my dataset.