Hi, I am new top to prodigy and i have used dataturcks a lot for labeling.
I need to extract organization name, location, email,contact name from contact us page of a given company HTML file . I am thinking of workflow like this.
->download around 50 html sources
->remove noise in the html source like remove footer, input,img,script, style etc
->extract remaining text data and store it in text file which contains data in separate lines.
Each cleaned html text file contains multiple sentences separated by new lines. Now i need to label each text file. I want to use ner.manual to label the data. Can someone clarify few things for me.
how to uniquely identify and label each document?.
i need to convert that text file into JSON or JSONL. Do i need to dump one cleaned file into one json file? or keep all 50 htmls data into one big json file like below?
"data": [
{
"text": "Apple Online Store
Visit the Apple Online Store to purchase Apple hardware, software and third-party accessories. To purchase by phone, please call 0800 048 0408. Lines are open Monday-Friday 08:00-20:00 and Saturday-Sunday 09:00-18:00.
.....................................
................................",
"text": "Helpline & Contact | Samsung UK
By ticking this box, I accept Samsung Service Updates, including : samsung.com Services and marketing information, new product and service announcements as well as special offers, events and newsletters
MOBILE: 24 HOURS, 7 DAYS A WEEK</p><p>ALL OTHER: M-F 8â12AM/S-S 9AMâ11PM, APPLIANCES 6PM ET"
}
]
I am assuming you have 50 docs in total. So you can simply write a function that takes each html text file and tagging them to a generic term such as "1000.txt", "1001.txt", 1002.txt", "1003.txt", ...
Each sentences (separated by new lines) will still have the same doc label so long it falls under the same doc. Q2 answers the format that you require your data set to be in.
Yes you will dump all 50 cleaned html files into one JSONL file. As mentioned above, each html file are differentiated from another by the "source" key it comes from. So, it should be in the following format...
{"text": " XXXXXXX ", "meta": {"source" : "1000.txt"}}
Yes, @jsnleong's solution for converting the data should work
This sounds like you definitely want to frame this as an NER task: label spans of text in your data for the different labels, and then train a model to reproduce this decision. The most straightforward way would be to run ner.manual with your labels:
Prodigy also encourages you to find more clever ways to automate the annotation so you have to do less work manually. For instance, one you have a pre-trained model that predicts something, you can have the model pre-highlight the entities. That's what workflows like the ner.make-gold are designed for.
Sure! You can run the db-out command to export your annotations to a JSONL file, and then use that to train pretty much any model using any framework. Prodigy uses a pretty straightforward JSONL format for the created annotations that should hopefully be very easy to use an work with. Here's an example of an annotated text with an entity:
LOG:
File âcython_src\prodigy\core.pyxâ, line 130, in prodigy.core.Controller.get_questions
File âcython_src\prodigy\components\feeds.pyxâ, line 58, in prodigy.components.feeds.SharedFeed.get_questions
File âcython_src\prodigy\components\feeds.pyxâ, line 63, in prodigy.components.feeds.SharedFeed.get_next_batch
File âcython_src\prodigy\components\feeds.pyxâ, line 147, in prodigy.components.feeds.SessionFeed.get_session_stream
ValueError: Error while validating stream: no first example. This likely means that your stream is empty.
Task queue depth is 1
Exception when serving /get_session_questions
Traceback (most recent call last):
File âcython_src\prodigy\components\feeds.pyxâ, line 140, in prodigy.components.feeds.SessionFeed.get_session_stream
File âC:\anaconda3\lib\site-packages\toolz\itertoolz.pyâ, line 368, in first
return next(iter(seq))
StopIteration
âŚ
When you see the error âError while validating stream: no first example. This likely means that your stream is empty.â, this usually means that thereâs nothing valid to load and that the incoming stream of examples is empty. What does your_converted_data.jsonl look like? It should be a valid JSONL file with every record containing a "text". For example:
{"text": "hello world"}
{"text": "this is a text"}
What operating system are you on? In any case, you have to set the environment variable PRODIGY_LOGGING to basic â so if you google âset environment variableâ plus your OS / environment, it should tell you how to do it
My OS: windows
set the environment variables below. Still cant recognize. Restarted the machine too.
PRODIGY_HOME=C:\Users\aaa.bbb\.prodigy
PRODIGY_LOGGING=basic
No I meant i did set properly with set at first instance (before your message). Everything is in environment variables. Still not recognized. Sorry for bothering.
12:01:15 - GET: /project
Task queue depth is 1
Task queue depth is 1
12:01:15 - POST: /get_session_questions
12:01:15 - FEED: Finding next batch of questions in stream
12:01:15 - CONTROLLER: Validating the first batch for session: data_100-default
12:01:15 - PREPROCESS: Tokenizing examples
12:01:15 - FILTER: Filtering duplicates from stream
12:01:15 - FILTER: Filtering out empty examples for key âtextâ
Exception when serving /get_session_questions
1)There are new lines symbols at the end of each line. is it common in Podigy?.
2)Also if there is a paragraph with new lines which i need to tag it as COMPANY_INFORMATION, but there is a contact information inside the paragraph. So i need to do label3 inside label4. But UI is not allowing me to do. is it possible to configure Prodigy somewhere to allow that option?
3)**555 ** Bloemfontein South Africa
can i label those three lines as one label called COMPANY_ADDRESS.?. OR does it need to be in one line?
There are several things here: Yes, knowing where a newline is is usually very important when you're annotating named entities. Newlines are tokens and you never want to accidentally highlight them (and without the symbols, they'd be pretty much invisible). You can hide them by setting "hide_true_newline_tokens": false in your prodigy.json, but I usually wouldn't recommend it, because it can easily lead to inconsistent annotations.
Alternatively, you might also consider preprocessing that normalises the whitespace. If you're training a model later on, just make sure to also pre-process your inputs at runtime to make sure it matches the training data.
If you want to train a named entity recognition model (especially with spaCy), training it to predict overlapping spans isn't possible. By definition, a token can only be part of one entity. That's also why you can't highlight overlapping spans. You can always make several passes over the data to capture nested spans, but I'm not sure that's the best solution here. It really depends on what you want to do with the data later on and what statistical model you want to train.
Thanks for your input. I was thinking of replacing new lines with spaces. Then the content will be in one big chunk of paragraph to label. Do you think its a good idea?. As i need to label company information as well which some times multi line information which is hard to label.
@mystuff You donât necessarily have to replace all newlines â youâd just have to make sure that the tokenizer produces separate tokens for newlines. For example, by replacing double newlines with single newlines. Or you could add a custom tokenization rule that always splits on \n.
when i run above train command that i am getting below error:
File "transition_system.pyx", line 148, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
It seems that there are some white spaces issue so i followed below post to fix it.
Still getting same error. I can see many of these in the dataset {"text":"\n","start":3359,"end":3360,"id":594},. do you think its a tokenization issue?. If so, how do i pass new line tokenizer while running "prodigy ner.batch-train "
Tokens containing \n are totally fine â it's only a problem if labelled entity spans in the "spans" start or end with a newline token, or consist only of newline tokens. This is an explicit change to the entity recognizer in spaCy v2.1 to make it more accurate and to prevent it from predicting entity spans like this, which are usually never what you want.
So if your data contains entries in the "spans" that are invalid like that, you should be able to just remove them and then re-import the edited data to a new dataset.
Thanks for your reply. I have found 3 spans with white spaces and \n and i removed and reloaded back to completley new dataset. Still getting same error.
Just for testing i tested with top4 and didnt get any error.