NER document Labeling

i finally fixed my issue by adding all white space chars like \r\t. not just " ". when i ran ner.batch-train below is the output. I have used default batch size. Also there is no duplicates in data.

Correct 420
Incorrect 419
Baseline 0.000
Accuracy 0.501

How do i improve accuracy? is it by adding more data(currently it has 150.)?

Yes, adding data should definitely be the first step. 150 examples is very low, so you won’t be seeing very reliable results.

Thought so. but just want to confirm. Thanks for the reply. Your reply means a lot, you gave me confidence that i am on right direction.

Hello @ines , I have increased the dataset from 150 to 300 using ner.manual.
Annotated new 150 and merged those with previous 150.
python -m prodigy ner.batch-train dataset_300 en_core_web_sm --output model_300 --label ........
The accuracy only increased 0.8%. May i know where i am doing wrong?. Is there a way to debug the accuracy?

dataset_150:
Correct 420
Incorrect 419
Baseline 0.000
Accuracy 0.501

dataset_300:
Correct 831
Incorrect 597
Baseline 0.000
Accuracy 0.582

300 examples is still a very low number of examples. To really be able to trust your results, you typically want a lot more - maybe like 1000 or 2000.

If you haven't seen it yet, check our my NER flowchart for some more tips:

Thanks for the detailed Flowchart. In that flow chart, it says 1000 sentences not 1000 documents. am i right?. I have more than 4000 sentences in my dataset.