.txt Source Loader for ner.teach

zik_23 · March 24, 2020, 12:40am

@ines, Hi Ines. I have a super basic question about NER.teach and another question on permission issue.

I am annotating a dataset about industrial equipment. I created the patterns through terms.teach. Now I am at a stage where I need to use ner.teach on a basic dataset descriptions.txt.

At the moment this source file has one sentence at every line, e.g. "gearbox for plant". Do I need to do any special formatting other than have one sentence per row/line?

The other issue I am running into. when I run the CL ner. teach
python -m prodigy ner.teach rotating_ner en_core_web_lg Train --loader txt --label Rotating --patterns rotating_patterns.jsonl
I have the text file in Train folder, I have full admin rights and I run the CLI as admin on a windows machine. But I keep getting the error PermissionError: [Errno 13] Permission denied: Train
Any insights and help on this will be greatly appreciated.

Zik

zik_23 · March 24, 2020, 3:37am

I solved the permission issue , for some reason I assumed that when I specify the source I only point to the Directory (in this case Train), so the cl command would be
`python -m prodigy ner.teach rotating_ner en_core_web_lg Train/mysourcedata --loader txt --label etc...

Insights on txt file format will be greatly appreciated.

ines · March 24, 2020, 10:00am

Yes, the format sounds fine – the .txt file should have one text per line At some point later on, you might want to consider using a more flexible format like .jsonl (newline-delimited JSON). This can be read in line by line as well (so you can work with larger files), but it gives you the flexibility of JSON to store arbitrary metadata with the texts.

The loader expects a path to a single file, not a directory. So that might be the problem here. If the problem still occurs with the file, maybe also check that the path is correct and that its permissions are set correctly and you have read access etc.?

zik_23 · March 24, 2020, 1:03pm

@ines, Thank you for the feedback. I managed to almost make it all work. However the UI gives me one sentence to annotate ( I have about 4000) and then after that one, it says no tasks available. What do you think is driving that?
is the text file format something like the below:

sky is blue
sky is cloudy
yrueryuerur

Or should there be anything different format wise. Same apply to JSONL delimited file , what would a typical structure look like?
Thanks

ines · March 24, 2020, 3:59pm

What's in your data and how does it look? And which labels are you annotating? The ner.teach recipe will try and find the most relevant examples for annotation based on the labels you choose. So it may skip examples in favour of others. If the model never predicts the given label, you may also not see any suggestions.

If you just want to go through all examples in your data and label them, you might want to try the ner.manual or ner.correct workflows instead. See here for details: https://prodi.gy/docs/named-entity-recognition#manual

zik_23 · March 24, 2020, 6:03pm

Thanks again for replying @ines The issue turned out to be in how the pattern file was generated from terms.teach it defaulted to something like this:
{"label":"Rotating","pattern":"Compressors"}, but in the data set that I was doing ner.teach for, most words where CAPS. so once I added lower it took care of that.
Is there a compelling reason for why terms.teach does not generate patterns to automatically look like this {"label":"Rotating","pattern"[{"lower"::"Compressors"}]}

ines · March 25, 2020, 9:59am

Glad it's working now

terms.teach will just use whatever is in the vectors. When you create the patterns using terms.to-patterns, you can specify a spaCy model that's used for tokenization and by default, those patterns will then be case-insensitive and use the token.lower_ property. So instead of "pattern": "Compressors", it'd produce "pattern": [{"lower": "compressors"}].

zik_23 · March 26, 2020, 4:55pm

Thank you again, I wonder from a design perspective if it makes sense to add the lower pattern automatically when a user does terms.teach and selects a spacy model there, like I did. This way when one does terms.to-patterns it will be default use the spacy model setup through terms.teach unless the user selects something different.
I supposed that's a feature request.

Topic		Replies	Views
Loading Multiple Files for ner.teach ner , custom , solved	4	1430	February 1, 2018
Using Loaders usage , solved	8	3579	November 12, 2018
Source file input format usage , solved , streams	2	1082	November 8, 2019
Cant load pre-annotated ner jsonl usage , ner , solved	8	1182	June 24, 2020
NER Entity recognition Permission Error (13, 'Access is denied	5	153	February 17, 2024

.txt Source Loader for ner.teach

Related topics