text classification

Hi guys,

after a successful NER model using a wonderful prodigy, I want to do content classification regarding observational data in my data set ( I have sentences I want to either annotate them as observational sents or edit my test labels)

I have provided jsonl data using this script

texts = sents  
examples = []
options= [
        {"id": 0, "text": "negative"},
        {"id": 1, "text": "positive"}]
       
for text in texts:
    task = {"text": text,"options":options}
    examples.append(task)  
    
write_jsonl("dfObs_01.jsonl", examples)

I want to either annotate data by your interface or edit annotation, for the former one I used this command

! python -m prodigy db-in dfObs dfObs_01.jsonl

then in order to classify to see the labels, I have done this

!python -m prodigy textcat.teach df_obs en_core_web_sm --label Observation

but interface goes to loading and does not show the sentences, am I correct track?
if I want to use my label (y) then modify it , how can I do it?
I am familiar with NER and want to kind of do the similar with text classification (edit label, make model) I would appreciate your response

Hi! I think the problem here is that you've imported your data to a Prodigy dataset, which holds the collected annotations. I think what you want to do instead is that your JSONL file and load it in as the source you're annotating in textcat.teach, e.g. as the third argument:

!python -m prodigy textcat.teach new_dataset en_core_web_sm dfObs_01.jsonl --label Observation

What do you want to do with your options? The textcat.teach recipe will only render the text with a given label, so if you want to have multiple-choice options, it sounds like the textcat.manual workflow is a better idea?

Also make sure you're using a new dataset to save the annotations to (otherwise, they'll be added to your previous data, which makes things messy). The reason you only saw "Loading..." btw is that Prodigy supports leaving out the source argument and piping data forward from a previous process. Because no source was set, it was basically waiting to read data from standard input.

1 Like

I tried to start to make workflow as following.

Work on my jsonl data (without label) by this:

python -m prodigy textcat.manual dfObsV0003 en_core_web_sm dfObsV03.jsonl --label Observation
Using 1 labels: Observation

but I faced with this error:


✨  ERROR: Invalid task format for view ID 'classification'
'label' is a required property

{'text': 'Chapter 1', '_input_hash': 1891558552, '_task_hash': -2011871074, '_session_id': 'dfObsV0003-default', '_view_id': 'classification'}

as you You mentioned here

that is bug. I add the edited version of script (since it has indent error)

def add_label_to_stream(stream, label):
        for eg in stream:
            eg["label"] = label[0]
            yield eg
        if has_options:
            stream = add_label_options(stream, label)
        else:
            stream = add_label_to_stream(stream, label)

to end of texcat. it does not work I also add that to end of
recipe manual...again it does not work...!
Am I doing any mistake? could be related to my jsonl data? can you give me a step by step way to manually annotate my data and then make a model based on my annotation?

Hi @ines
Morning! good news! I managed to run
textcat.manual on a raw text without label in order to gather labels as follows

python -m prodigy textcat.manual dfobsv02 en_core_web_sm dfObsV02.jsonl --label Observational,Nonobservational --exclusive

now annotator can manually annotate observational sentence from nonobservational sentences

my question is if I have predefined labels (y values, basically 0 and 1) how can I run textcat.manual to edit those (similar to ner.manual)
I think I should change the type of my data since now it is so:

{"text":"my text."}
{"text":"my test 02."}
{"text":"my test 03."}

my second question is , can I see my ner labels when I classify the sentences in textcat.manual how to use my NER label .. to kind of improve classification?

and my last question maybe it is not related to this threat. how can I add-relation to entities?
I have read this


still, dot now how can I use my annotated data (ner) by prodigy to make training set for relation extraction

I really like the prodigy , I have learned a lot here.
Best

hi,

I would be very thankful if you can answer my question in last comment?

many many thanks

I'm a bit confused by what you're trying to do. Do you want to assign both top-level categories and highlight spans in the text? If so, you probably want to do this in two different steps.

Do you mean, use the predicted entity spans as features in the text classifier? Not by default if you're using spaCy text classification implementation. You'd probably have to build something custom. Although, the entities being present in the text will likely still have an impact – if texts containing entity X are typically about Y, the text classifier can pick up on that based on the words occuring in the text.

This depends on what data you need to train your relation extraction model. One approach could be to stream in pairs of entities that are close in the text and then annotate their relations using the choice interface, as described in the thread you linked.

1 Like

"
I'm a bit confused by what you're trying to do. Do you want to assign both top-level categories and highlight spans in the text? If so, you probably want to do this in two different steps.
"

I want to use the classification interface (texcat) and also see the result of my NER in the monitor.is it possible?

"
This depends on what data you need to train your relation extraction model. One approach could be to stream in pairs of entities that are close in the text and then annotate their relations using the choice interface, as described in the thread you linked.
"

how can I start this choice interface ?


another different idea:

I want to create large realtion (inclusing causally)-labelled datasets for supervised machine learning NP1-VERB-NP2

I have created the verbs by reverb instruction

then I want to provide a dataset like this:

FUNCTION relation between airplane and transportation in “the airplane
is used for transportation”

a PART-WHOLE relation in “the car has an engine”.

ACQUISITION between named entities in “Yahoo has made a definitive agreement to
acquire Flickr”.

here, There is a bit missing which is to find the closest noun-phrases to right and left of the pattern which gave me the verb, then I want to kind annotate each pair in the sentences to different relation as well as
"causal relation ", "FUNCTION relation"or....

for this aim, I need first find noun-phrases to right and left of the pattern (verbs)

then I have kind of

"text" , "e1","e2" ,"relation",

I want to assign this relation in prodigy,then train a model on that :slight_smile:

my question how can I start in prodigy?

do you have any hints how can I find closet name in right and left to verb ?

Have you tried streaming in data that contains the "spans"? I think textcat.teach might reset the existing spans because it also uses spans to pre-highlight pattern matches. But if you're not using a model in the loop and are just labelling with multiple-choice options, you can just leave the pre-anotated spans in the data, add the options and you should see them highlighted in the interface.

Check out the documentation on custom recipes and interfaces, for example, starting here: https://prodi.gy/docs/workflow-custom-recipes#example-choice You might also want to check out the PRODIGY_README.html, which includes the detailed documentation of the components.

You could start by labelling all noun phrases you're interested in, either by hand or using spaCy to pre-select them for you (e.g. via the Doc.noun_chunks or just by extracting noin tokens). This will give you the spans of tokens and their position in the text. You could then use spaCy to extract the verbs attached to the nouns – e.g. using the dependency parse. Next, you can stream this information into Prodigy and use the choice interface to select the relation – e.g. using the choice interface. You might have to experiment a bit to see what works best.

1 Like