Text corpus .txt file to json/spacy format file

I am not sure if someone else has already asked this question.
I am new to Prodigy and spaCy. I have prodigy nighty due to compatibility with spaCy 3.x. I want to train the NER model to generalize and detect operational parameters like temperature and pressure.
The first step is to convert the train text corpus (available in .txt format) into a .json file or is it possible to directly jump to .spacy format as spaCy 3.X use a new spacy format.
Would you please help me with how may I easily perform this step? I want to tokenize each span as well automatically.
Expected result:
"text": "This is a text about Facebook.",
"spans": [{"start": 21, "end": 29, "label": "ORG"}]

Hi! The solution here kinda depends on how your .txt file is structured – is it just plain text that you want to annotate? If so, you can just load it into Prodigy directly, and it will be read in line-by-line. You can then annotate the data and export it for training with spaCy.

If your .txt file includes the annotations, you'd need to extract the text and annotations from it – how you do that depends on how it's formatted. If your existing data only contains token-based BILUO tags instead of character offsets, you can use spaCy's helper functions to convert them to offsets: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

If you want to convert your annotations to a .spacy file for training with spaCy directly, check out this docs section: https://spacy.io/usage/training#data-convert Under the hood, the .spacy file is a serialized DocBin, a collection of spaCy Doc objects. So you'd just need to create one Doc object for each example in your data, and set the annotations you want to use.

1 Like

Thank you for your prompt response.
Yes input data is in .txt format without annotations. I am glad to know prodigy can directly input .txt file for annotation purpose. I will go through the provided links and share with you if I have any questions.

Many thanks

As you have mentioned, prodigy allows to annotate plain text from .txt format. Using the following command on terminal:
prodigy ner.manual annotateddata blank:en ./data1.txt --label temperature, pressure
Gives me
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 565: invalid start byte

Any suggestions?

Best regards
Zaid Kamil

This sounds like an encoding issue with your file. If you google something like "change file encoding utf8" plus your operating system, you should find instructions to check whether your file is UTF-8 and how to change the encoding if it's not.

Thank you, kindly.