.txt Source Loader for ner.teach

@ines, Hi Ines. I have a super basic question about NER.teach and another question on permission issue.

I am annotating a dataset about industrial equipment. I created the patterns through terms.teach. Now I am at a stage where I need to use ner.teach on a basic dataset descriptions.txt.

At the moment this source file has one sentence at every line, e.g. "gearbox for plant". Do I need to do any special formatting other than have one sentence per row/line?

The other issue I am running into. when I run the CL ner. teach
python -m prodigy ner.teach rotating_ner en_core_web_lg Train --loader txt --label Rotating --patterns rotating_patterns.jsonl
I have the text file in Train folder, I have full admin rights and I run the CLI as admin on a windows machine. But I keep getting the error PermissionError: [Errno 13] Permission denied: Train
Any insights and help on this will be greatly appreciated.


I solved the permission issue , for some reason I assumed that when I specify the source I only point to the Directory (in this case Train), so the cl command would be
`python -m prodigy ner.teach rotating_ner en_core_web_lg Train/mysourcedata --loader txt --label etc...

Insights on txt file format will be greatly appreciated.

Yes, the format sounds fine – the .txt file should have one text per line :slightly_smiling_face: At some point later on, you might want to consider using a more flexible format like .jsonl (newline-delimited JSON). This can be read in line by line as well (so you can work with larger files), but it gives you the flexibility of JSON to store arbitrary metadata with the texts.

The loader expects a path to a single file, not a directory. So that might be the problem here. If the problem still occurs with the file, maybe also check that the path is correct and that its permissions are set correctly and you have read access etc.?

@ines, Thank you for the feedback. I managed to almost make it all work. However the UI gives me one sentence to annotate ( I have about 4000) and then after that one, it says no tasks available. What do you think is driving that?
is the text file format something like the below:

sky is blue
sky is cloudy

Or should there be anything different format wise. Same apply to JSONL delimited file , what would a typical structure look like?

What's in your data and how does it look? And which labels are you annotating? The ner.teach recipe will try and find the most relevant examples for annotation based on the labels you choose. So it may skip examples in favour of others. If the model never predicts the given label, you may also not see any suggestions.

If you just want to go through all examples in your data and label them, you might want to try the ner.manual or ner.correct workflows instead. See here for details: https://prodi.gy/docs/named-entity-recognition#manual

Thanks again for replying @ines The issue turned out to be in how the pattern file was generated from terms.teach it defaulted to something like this:
{"label":"Rotating","pattern":"Compressors"}, but in the data set that I was doing ner.teach for, most words where CAPS. so once I added lower it took care of that.
Is there a compelling reason for why terms.teach does not generate patterns to automatically look like this {"label":"Rotating","pattern"[{"lower"::"Compressors"}]}

Glad it's working now :slightly_smiling_face:

terms.teach will just use whatever is in the vectors. When you create the patterns using terms.to-patterns, you can specify a spaCy model that's used for tokenization and by default, those patterns will then be case-insensitive and use the token.lower_ property. So instead of "pattern": "Compressors", it'd produce "pattern": [{"lower": "compressors"}].

Thank you again, I wonder from a design perspective if it makes sense to add the lower pattern automatically when a user does terms.teach and selects a spacy model there, like I did. This way when one does terms.to-patterns it will be default use the spacy model setup through terms.teach unless the user selects something different.
I supposed that's a feature request.