NER document Labeling

Hi, I am new top to prodigy and i have used dataturcks a lot for labeling.
I need to extract organization name, location, email,contact name from contact us page of a given company HTML file . I am thinking of workflow like this.
->download around 50 html sources
->remove noise in the html source like remove footer, input,img,script, style etc
->extract remaining text data and store it in text file which contains data in separate lines.

Each cleaned html text file contains multiple sentences separated by new lines. Now i need to label each text file. I want to use ner.manual to label the data. Can someone clarify few things for me.

  1. how to uniquely identify and label each document?.
  2. i need to convert that text file into JSON or JSONL. Do i need to dump one cleaned file into one json file? or keep all 50 htmls data into one big json file like below?
    "data": [
        {
            "text": "Apple Online Store
                        Visit the Apple Online Store to purchase Apple hardware, software and third-party accessories. To purchase by phone, please call 0800 048 0408. Lines are open Monday-Friday 08:00-20:00 and Saturday-Sunday 09:00-18:00.
.....................................
................................",
            "text": "Helpline & Contact | Samsung UK
                          By ticking this box, I accept Samsung Service Updates, including : samsung.com Services and marketing information, new product and service announcements as well as special offers, events and newsletters
MOBILE: 24 HOURS, 7 DAYS A WEEK</p><p>ALL OTHER: M-F 8–12AM/S-S 9AM–11PM,  APPLIANCES 6PM ET"
          
        }
    ]
  1. can i feed data set to non-spacy api as well?

Hi, this is my thought process..

I am assuming you have 50 docs in total. So you can simply write a function that takes each html text file and tagging them to a generic term such as "1000.txt", "1001.txt", 1002.txt", "1003.txt", ...
Each sentences (separated by new lines) will still have the same doc label so long it falls under the same doc. Q2 answers the format that you require your data set to be in.

Yes you will dump all 50 cleaned html files into one JSONL file. As mentioned above, each html file are differentiated from another by the "source" key it comes from. So, it should be in the following format...
{"text": " XXXXXXX ", "meta": {"source" : "1000.txt"}}

Hope it helps.

1 Like

Yes, @jsnleong's solution for converting the data should work :slightly_smiling_face:

This sounds like you definitely want to frame this as an NER task: label spans of text in your data for the different labels, and then train a model to reproduce this decision. The most straightforward way would be to run ner.manual with your labels:

prodigy ner.manual your_dataset en_core_web_sm your_converted_data.jsonl --label ORG,LOCATION,EMAIL,NAME

Prodigy also encourages you to find more clever ways to automate the annotation so you have to do less work manually. For instance, one you have a pre-trained model that predicts something, you can have the model pre-highlight the entities. That's what workflows like the ner.make-gold are designed for.

Sure! You can run the db-out command to export your annotations to a JSONL file, and then use that to train pretty much any model using any framework. Prodigy uses a pretty straightforward JSONL format for the created annotations that should hopefully be very easy to use an work with. Here's an example of an annotated text with an entity:

{
    "text": "Hello Apple",
    "tokens": [
        { "text": "Hello", "start": 0, "end": 5, "id": 0 },
        { "text": "Apple", "start": 6, "end": 11, "id": 1 }
    ],
    "spans": [{ "start": 6, "end": 11, "label": "ORG", "token_start": 1, "token_end": 1 }]
}

Thank you so much for your prompt reply. Now i have created jsonl file with 100 json content. I am about to annotate by running the manual recipe.

python -m prodigy ner.manual company_details_dataset en_core_web_sm your_converted_data.jsonl --label COMPANY_TYPE,COMPANY_INFORMATION, COMPANY_NAME,COMPANY_DEPARTMENT,COMPANY_ADDRESS,COMPANY_COUNTRY_USA,EMAIL,NAME.

But i am getting following error on web ui. I think i need to add my labels to the model. May i know how to add my own lables to the model?

ERROR: can’t fetch tasks. Make sure the server is running correctly.
Oops, something went wrong :frowning:

I have followed "No tasks available" for any text source I give for ner.teach recipe but i can not use built-in labels for my task as i have to identify type of the company, categorize different addresses and also different sections of company name like Departments.

LOG:
File “cython_src\prodigy\core.pyx”, line 130, in prodigy.core.Controller.get_questions
File “cython_src\prodigy\components\feeds.pyx”, line 58, in prodigy.components.feeds.SharedFeed.get_questions
File “cython_src\prodigy\components\feeds.pyx”, line 63, in prodigy.components.feeds.SharedFeed.get_next_batch
File “cython_src\prodigy\components\feeds.pyx”, line 147, in prodigy.components.feeds.SessionFeed.get_session_stream
ValueError: Error while validating stream: no first example. This likely means that your stream is empty.
Task queue depth is 1
Exception when serving /get_session_questions
Traceback (most recent call last):
File “cython_src\prodigy\components\feeds.pyx”, line 140, in prodigy.components.feeds.SessionFeed.get_session_stream
File “C:\anaconda3\lib\site-packages\toolz\itertoolz.py”, line 368, in first
return next(iter(seq))
StopIteration
…

When you see the error “Error while validating stream: no first example. This likely means that your stream is empty.”, this usually means that there’s nothing valid to load and that the incoming stream of examples is empty. What does your_converted_data.jsonl look like? It should be a valid JSONL file with every record containing a "text". For example:

{"text": "hello world"}
{"text": "this is a text"}

{“text”: “xxxxxxxxx”, “meta”: {“source”: “company1.txt”}}
{“text”: “yyyyyyyyy”, “meta”: {“source”: “company2.txt”}}
{“text”: “zzzzzzzzz”, “meta”: {“source”: “company3.txt”}}
{“text”: “aaaaaaaaa”, “meta”: {“source”: “company4.txt”}}

That looks correct! Could you run the command with PRODIGY_LOGGING=basic and share the output? So basically:

PRODIGY_LOGGING=basic python -m prodigy ner.manual company_details_dataset en_core_web_sm your_converted_data.jsonl --label COMPANY_TYPE,COMPANY_INFORMATION, COMPANY_NAME,COMPANY_DEPARTMENT,COMPANY_ADDRESS,COMPANY_COUNTRY_USA,EMAIL,NAME

‘PRODIGY_LOGGING’ is not recognized as an internal or external command,
operable program or batch file.

What operating system are you on? In any case, you have to set the environment variable PRODIGY_LOGGING to basic – so if you google “set environment variable” plus your OS / environment, it should tell you how to do it :slightly_smiling_face:

My OS: windows
set the environment variables below. Still cant recognize. Restarted the machine too.
PRODIGY_HOME=C:\Users\aaa.bbb\.prodigy
PRODIGY_LOGGING=basic

I think you might have to call set? See here:

No I meant i did set properly with set at first instance (before your message). Everything is in environment variables. Still not recognized. Sorry for bothering.

12:01:15 - GET: /project
Task queue depth is 1
Task queue depth is 1
12:01:15 - POST: /get_session_questions
12:01:15 - FEED: Finding next batch of questions in stream
12:01:15 - CONTROLLER: Validating the first batch for session: data_100-default
12:01:15 - PREPROCESS: Tokenizing examples
12:01:15 - FILTER: Filtering duplicates from stream
12:01:15 - FILTER: Filtering out empty examples for key ‘text’
Exception when serving /get_session_questions

I solved this my upgrading murmurhash.

1 Like

image

1)There are new lines symbols at the end of each line. is it common in Podigy?.

2)Also if there is a paragraph with new lines which i need to tag it as COMPANY_INFORMATION, but there is a contact information inside the paragraph. So i need to do label3 inside label4. But UI is not allowing me to do. is it possible to configure Prodigy somewhere to allow that option?

3)**555 **
Bloemfontein
South Africa
can i label those three lines as one label called COMPANY_ADDRESS.?. OR does it need to be in one line?

There are several things here: Yes, knowing where a newline is is usually very important when you're annotating named entities. Newlines are tokens and you never want to accidentally highlight them (and without the symbols, they'd be pretty much invisible). You can hide them by setting "hide_true_newline_tokens": false in your prodigy.json, but I usually wouldn't recommend it, because it can easily lead to inconsistent annotations.

Alternatively, you might also consider preprocessing that normalises the whitespace. If you're training a model later on, just make sure to also pre-process your inputs at runtime to make sure it matches the training data.

If you want to train a named entity recognition model (especially with spaCy), training it to predict overlapping spans isn't possible. By definition, a token can only be part of one entity. That's also why you can't highlight overlapping spans. You can always make several passes over the data to capture nested spans, but I'm not sure that's the best solution here. It really depends on what you want to do with the data later on and what statistical model you want to train.

Thanks for your input. I was thinking of replacing new lines with spaces. Then the content will be in one big chunk of paragraph to label. Do you think its a good idea?. As i need to label company information as well which some times multi line information which is hard to label.

@mystuff You don’t necessarily have to replace all newlines – you’d just have to make sure that the tokenizer produces separate tokens for newlines. For example, by replacing double newlines with single newlines. Or you could add a custom tokenization rule that always splits on \n.

I have annotated 150 htmls, the raw text is separated by new lines. Now i need to run "prodigy ner.batch-train ". am i right?.

python -m prodigy ner.batch-train company_details_dataset en_core_web_sm --output company_model --label COMPANY_TYPE,COMPANY_INFORMATION, COMPANY_NAME,COMPANY_DEPARTMENT,COMPANY_ADDRESS,COMPANY_COUNTRY_USA,EMAIL,NAME

when i run above train command that i am getting below error:
File "transition_system.pyx", line 148, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

It seems that there are some white spaces issue so i followed below post to fix it.

Still getting same error. I can see many of these in the dataset {"text":"\n","start":3359,"end":3360,"id":594},. do you think its a tokenization issue?. If so, how do i pass new line tokenizer while running "prodigy ner.batch-train "

Tokens containing \n are totally fine – it's only a problem if labelled entity spans in the "spans" start or end with a newline token, or consist only of newline tokens. This is an explicit change to the entity recognizer in spaCy v2.1 to make it more accurate and to prevent it from predicting entity spans like this, which are usually never what you want.

So if your data contains entries in the "spans" that are invalid like that, you should be able to just remove them and then re-import the edited data to a new dataset.

Thanks for your reply. I have found 3 spans with white spaces and \n and i removed and reloaded back to completley new dataset. Still getting same error.

Just for testing i tested with top4 and didnt get any error.

python -m prodigy ner.batch-train top4_ner en_core_web_sm --output top4-model --eval-split 0.2 --n-iter 6 --dropout 0.2
The output for top4:
17:09:14 - MODEL: Merging entity spans of 0 examples
17:09:14 - MODEL: Using 0 examples (without ‘ignore’)
17:09:14 - MODEL: Evaluated 0 examples
06 1884.572 0 0 0 0 0.000

Correct 0
Incorrect 0
Baseline 0.000
Accuracy 0.000

17:09:14 - RECIPE: Restoring disabled pipes: [‘tagger’, ‘parser’]

Model: C:\top4-model
Training data: C:\top4-model\training.jsonl

does it look alright for you?. if so, can i redo entire annotation for those 100 .