Training the annotated data

kushalrsharma · January 23, 2023, 1:49am

I have 5 machines where I already annotate data and saved in databases. Now I need to train the annotated data to see the results in a single machine. How should I go further?
Thank You

kushal_pythonist · January 23, 2023, 3:48am

Actually i m also looking for this answer. Please mention me if you got this answer.

koaning · January 23, 2023, 10:06am

It may be better to have a single machine where you collect the data, but you allow multiple users to annotate via the web interface. This would make data collection less error prone and it would also make training models a lot simpler.

That said, given your situation, you might choose to export all the data first on each machine via db-out. And then you'd need to collect all five files unto a single machine so that you can import it via db-in.

Does this help?

kushal_pythonist · January 23, 2023, 1:15pm

db-out from every machines and collect them, while merging

Traceback (most recent call last):
File "/home/kushal/Documents/spacyprodigy/demofirt/demo/scriptt.py", line 11, in
for obj in reader:
File "/home/kushal/Documents/spacyprodigy/.venv/lib/python3.10/site-packages/jsonlines/jsonlines.py", line 416, in iter
yield self.read(
File "/home/kushal/Documents/spacyprodigy/.venv/lib/python3.10/site-packages/jsonlines/jsonlines.py", line 289, in read
lineno, line = next(self._line_iter)
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/lib/python3.10/encodings/utf_8_sig.py", line 69, in _buffer_decode
return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

this error shows while using this script

import jsonlines
import os
path = "/home/kushal/Documents/spacyprodigy/annotated_data"
filenames = [f for f in os.listdir(path) if f.endswith(".jsonl")]

with jsonlines.open("resume_all.jsonl", mode='w') as writer:
for filename in filenames:
file_path = os.path.join(path, filename)
with jsonlines.open(file_path) as reader:
for obj in reader:
writer.write(obj)

and after importing via db-in and train it works.

koaning · January 23, 2023, 3:08pm

You can use markdown syntax to highlight your code segments, which makes it easier to read/copy/paste on this forum. I'm mentioning this because it seems you've used a quote > instead of a codeblock to show code.

It seems that you're using a library, jsonlines that has trouble with reading in your data. Could you try again but using the srsly library? That's one that we maintain.

import pathlib
import srsly 

full_dataset = [] 
for path in pathlib.Path("path/to/folder").glob("*.jsonl"):
    for ex in srsly.read_jsonl(path):
        full_dataset.append(ex)

srsly.write_jsonl("path/to/folder/name.jsonl", full_dataset)

kushal_pythonist · January 23, 2023, 6:15pm

Today i just got confusion, i have 100's of text file, i need to convert it to the jsonl
and successfully converted to jsonl for NER. Now i get all the text file converted into a single jsonl files.

!python -m prodigy ner.manual resume blank:en /home/kushal/Documents/spacyprodigy/output.jsonl --label ACADEMIC,EMAIL,ADDRESS,INSTITUTION,URL,PROFILE_NAME,PROFILE_CITY,PROFILE_ZIP,EXPERIENCE,HARDSKILLS,SOFTSKILLS,DATE,DEGREE,ROLE,DOMAIN,COMPANY

my main question is that : Am i doing the right thing? And one thing is that all these need to annotate by single person? or can i distrubute this to multiple person to that they can annotate parallely?

koaning · January 25, 2023, 10:35am

Hi Kushal,

it's a bit unclear to me what problem you're facing. Does the aforementioned code block help you merge multiple files together? That seemed to have been your main problem, but your new response suggests that you're dealing with files that are unlabelled? Is that correct?

If you want to distribute the annotation workload to multiple people then you can do that by allow annotators to pass their name into the session variable. The docs explain this in more detail here:

kushal_pythonist · January 25, 2023, 10:50am

koaning:

import pathlib
import srsly 

full_dataset = [] 
for path in pathlib.Path("path/to/folder").glob("*.jsonl"):
    for ex in srsly.read_jsonl(path):
        full_dataset.append(ex)

srsly.write_jsonl("path/to/folder/name.jsonl", full_dataset)

This helps me to merge all the jsonl files and to prepare a corpus this is helping me a lot.

Explaining this, i pass the same above prepared corpus to the file and when i try to use it for multiple annotators it works fine for only my machine i.e i opened different tabs and all of the sessions are getting different data which is fine but i need to do the same for the group of peoples in my office. Is there any other way that i can create a private link that can be accessed by my fellow mates and start the annotation process. I think you got me here. Can you share me some tutorials for this so that i can work on it. Thank you

koaning · January 25, 2023, 10:58am

You would need to host Prodigy on a IP adress that you're colleagues can reach. That means either provisioning a server or using a service like Ngrok. Be aware that if you're using Ngrok, you are exposing the annotation interface to the whole internet and anybody with the link is able to push annotations to you.

If you're interested in a Ngrok tutorial, you may appreciate this course form calmcode:

kushalrsharma · January 25, 2023, 11:47am

Thank you for this tutorial. i used this in prodigy.json: feed_overlap: false and when i create a private link using ngrok it works and sharing the link to fellow annotators at the same time. and when the link opens both anotators have the same set of data to work. What to do in this case?

koaning · January 26, 2023, 1:43pm

Have you seen our docs on multi user annotation?

Topic		Replies	Views
Lesser annotations for training despite having more annotations in database ner , spacy , more-info-needed	1	558	September 25, 2020
Getting Started Questions usage , ner	1	631	November 6, 2018
Training on part of the custom annotations usage , ner , database	4	676	October 22, 2021
questions on Multi NERs Annotation & Training at Once in a Sentence usage , ner , spacy	5	615	October 3, 2022
Spacy features - NER manual ? ner , spacy , solved	5	560	January 31, 2021

Training the annotated data

Related topics