Match Pattern Converter: Dataframe to JSON

JanF · May 16, 2021, 7:19am

Hi there,

sorry another newbie question.

Most likely you already have a working solution for the following task, i couldn't find one on your homepage or through google though

I have a simple dataframe with single and multi-word terms that i want to use as a match pattern for seeding my annotation process.

So i need to convert the dataframe to the appropriate JSON-Format.
Just using ".to_json" wont work since i need the multi words split by the nested "lower" (i really need them ) tags.

So sticking to your homepage example - i have a dataframe with two columns and maaaany seed words:

label pattern
FASHION_BRAND Ann Taylor
FASHION_BRAND Naked And Famous
FASHION_BRAND TOPSHOP
.....(Thousands more)

that i need that converted to your JSON-Pattern ( no vectorisation or anything):

{"label": "FASHION_BRAND", "pattern": [{"lower": "ann"}, {"lower": "taylor"}]}
{"label": "FASHION_BRAND", "pattern": [{"lower": "topshop"}]}
{"label": "FASHION_BRAND", "pattern": [{"lower": "naked"}, {"lower": "and"}, {"lower": "famous"}]}

I guess you have a working function/solution for this task that you could point me a direction to?

Thanks,
Jan

SofieVL · May 16, 2021, 8:34pm

Hi Jan,

The key here will be to make sure the tokens in your patterns match the tokenization method you'll use during annotation. So, let's say you'll be using an English model, then you can rely on the same model to split your seed terms into tokens and construct the pattern dictionary.

Something like this:

from spacy.lang.en import English

nlp = English()
for col1, col2 in dataframe:
    tokens = [t.text for t in nlp(col2)]
    pattern = [{"lower": token.lower()} for token in tokens] 
    pattern_dict = {"label": col1, "pattern": pattern}

Then you can write that to file.

Depending on how you're loading your data, you could also run multiple seed words through your nlp object with nlp.pipe, which will typically be faster. (if speed turns out to be an issue for this conversion).

Is that what you're looking for?

JanF · May 24, 2021, 8:54am

Thank you so much, Sofi - that worked
i only needed to do a slight adjustment to the code (see below if anymore else might ever read/need this).

import json
import codecs 
# in this case german language is processed
from spacy.lang.de import German

nlp = German()
with codecs.open('match_pattern.csv', 'w', 'utf-8') as fp:   
    for col1, col2 in dataframe[['Label', 'Name']].itertuples(index=False):
        tokens = [t.text for t in nlp(col2)]
        pattern = [{"lower": token.lower()} for token in tokens] 
        pattern_dict = {"label": col1, "pattern": pattern}
        fp.write(json.dumps(pattern_dict, ensure_ascii=False)+'\n')

This Generated one Match Pattern per line and writes that out to CSV.
The Tokenization applied to the match patterns pretty much seems like a wordbased tokenization.

But, thats brings up another newbie question - sorry :

I in my particular case would like to train (>>for now<<) transformer based models to recognise a new NER-Type "Skill" (for my master thesis).

For that case i scraped raw natural language data that i would like to pre-annotate by the above generated match-patterns and save that data as my >>model-neutral<< baseline dataset.

Afterwards id like to feed that training/test-data to different NER-Frameworks (e.G AutoNLP by Huggingfaces or T-NER-Framework) to Train ALL different kinds of (for now) transformer based models automatically.

So like with other Standard Datasets like CoNLL 2003 etc. i assume these are plattform neutral and use something like a wordbase tokenization every other model can easily build on?
Otherwise there would always be something like a Mapping Layer necessary to reformate the language model to the approriate tokenization strategy.

So is it also possible to build a most neutral pro-annotated dataset first hand and save that in a platform independent format (e.g. CoNLL 2003) and use that as input for training with adapting to the specific tokenization strategy of the used model "on-the-fly"?

SofieVL · May 26, 2021, 10:21am

If you annotate spans with Prodigy, you need to define your tokenization method because the spans will "snap" to token boundaries. This is extremely efficient for doing the actual annotation: you just double-click a word and the correct span including that token is highlighted.

Also for defining the patterns, like I said, they need to match the tokenization.

If you run Prodigy and define a "blank" model like blank:en, it'll pull up the tokenization as defined in the corresponding spacy Language class, for instance https://github.com/explosion/spaCy/blob/master/spacy/lang/en/init.py defines the tokenizer exceptions, infixes etc. You're right that this is mostly based on separating on spaces, though there are some special cases, too.

The good news is, the output will always contain the results of that tokenization. If you use prodigy db-out my_dataset you'll get the annotations in JSONL format (https://prodi.gy/docs/recipes#db-out) and you'll see something like this as part of each annotation (added newlines for readability):

{"text":"Berlin is so much nicer than London.",
"tokens":[{"text":"Berlin","start":0,"end":6,"id":0,"ws":true},
{"text":"is","start":7,"end":9,"id":1,"ws":true},
{"text":"so","start":10,"end":12,"id":2,"ws":true},
{"text":"much","start":13,"end":17,"id":3,"ws":true},
{"text":"nicer","start":18,"end":23,"id":4,"ws":true},
{"text":"than","start":24,"end":28,"id":5,"ws":true},
{"text":"London","start":29,"end":35,"id":6,"ws":false},
{"text":".","start":35,"end":36,"id":7,"ws":false}],...}

This means that you're able to reconstruct the original tokens as defined by the tokenizer, and you should be able to use that for conversion to any other tool/tokenizer as well.

Within spacy, we provide built-in support for aligning different tokenizations through the Example class, cf. for instance https://spacy.io/api/example#get_aligned.

I can't help you with other libraries of the top of my head, but if you run into any specific issues with spaCy and tokenization alignment, feel free to open a discussion thread here: Discussions · explosion/spaCy · GitHub

JanF · June 2, 2021, 12:03pm

Thank you again for your help, sofie
Its very much appreciated and needed ^^.
One last question (for now :D):

as it seems spacy v3 offers advanced transformer handling & pipelines. i guess i might benefit from this functionality in my attemps to train a transformer model.
I have to run spacy in v. 2.2 for prodigy since i am not part of the nighty release schedule. Is there a way to get ahold of a v3 compatible version or do you suggest i better stick with v2.2 for the time being?

SofieVL · June 2, 2021, 9:07pm

Hi Jan,

The transformer-based pipelines were rewritten significantly for spaCy 3. You can still use the old version of spacy-transformers (0.x), but I do think it's been significantly improved since 1.x for spacy 3.x

Have you tried applying for the Prodigy nightly program? It's open to all users of the latest version, v1.10.x.

JanF · June 3, 2021, 8:28am

Yes, i didn't get an eMail though, so i guess i wasnt accepted ?

SofieVL · June 3, 2021, 1:54pm

I just checked internally with the team, and you should have gotten an email with an invite some time ago (in reply to your original request). We also resent it to you today.

If you haven't gotten it - can you double check your spam folder?

JanF · June 4, 2021, 7:02am

Thanks! Got it now!!

Topic		Replies	Views
Convert output of spaCy PhraseMatcher to prodigy JSONL ner , spacy , solved	3	1144	May 3, 2021
How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files? usage , ner , spacy	1	2644	August 22, 2021
SpaCy3 models evaluation on a custom dataset usage , spacy , solved , training	3	641	July 7, 2021
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	594	June 15, 2020
combining two annotated datasets usage , ner , spacy , solved	5	1528	July 28, 2020

Match Pattern Converter: Dataframe to JSON

Related topics