Text corpus .txt file to json/spacy format file

zaidkamil · June 26, 2021, 3:36pm

Hi,
I am not sure if someone else has already asked this question.
I am new to Prodigy and spaCy. I have prodigy nighty due to compatibility with spaCy 3.x. I want to train the NER model to generalize and detect operational parameters like temperature and pressure.
The first step is to convert the train text corpus (available in .txt format) into a .json file or is it possible to directly jump to .spacy format as spaCy 3.X use a new spacy format.
Would you please help me with how may I easily perform this step? I want to tokenize each span as well automatically.
Expected result:
{
"text": "This is a text about Facebook.",
"spans": [{"start": 21, "end": 29, "label": "ORG"}]
}

ines · June 28, 2021, 2:13am

Hi! The solution here kinda depends on how your .txt file is structured – is it just plain text that you want to annotate? If so, you can just load it into Prodigy directly, and it will be read in line-by-line. You can then annotate the data and export it for training with spaCy.

If your .txt file includes the annotations, you'd need to extract the text and annotations from it – how you do that depends on how it's formatted. If your existing data only contains token-based BILUO tags instead of character offsets, you can use spaCy's helper functions to convert them to offsets: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

If you want to convert your annotations to a .spacy file for training with spaCy directly, check out this docs section: https://spacy.io/usage/training#data-convert Under the hood, the .spacy file is a serialized DocBin, a collection of spaCy Doc objects. So you'd just need to create one Doc object for each example in your data, and set the annotations you want to use.

zaidkamil · June 28, 2021, 3:23pm

Thank you for your prompt response.
Yes input data is in .txt format without annotations. I am glad to know prodigy can directly input .txt file for annotation purpose. I will go through the provided links and share with you if I have any questions.

Many thanks
Zaid

zaidkamil · July 1, 2021, 2:24pm

Hi,
As you have mentioned, prodigy allows to annotate plain text from .txt format. Using the following command on terminal:
prodigy ner.manual annotateddata blank:en ./data1.txt --label temperature, pressure
Gives me
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 565: invalid start byte

Any suggestions?

Best regards
Zaid Kamil

ines · July 2, 2021, 1:02am

This sounds like an encoding issue with your file. If you google something like "change file encoding utf8" plus your operating system, you should find instructions to check whether your file is UTF-8 and how to change the encoding if it's not.

zaidkamil · July 2, 2021, 6:30pm

Thank you, kindly.

Topic		Replies	Views
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	593	June 15, 2020
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Ner format to CONLL usage , ner , solved	7	5360	June 4, 2019
spaCy, prodigy, annotation usage , ner , solved	2	720	February 8, 2019
Prodigy annotations to SpaCy train spacy	13	5610	January 31, 2018

Text corpus .txt file to json/spacy format file

Related topics