Training the annotated data

I have 5 machines where I already annotate data and saved in databases. Now I need to train the annotated data to see the results in a single machine. How should I go further?
Thank You

Actually i m also looking for this answer. Please mention me if you got this answer.

It may be better to have a single machine where you collect the data, but you allow multiple users to annotate via the web interface. This would make data collection less error prone and it would also make training models a lot simpler.

That said, given your situation, you might choose to export all the data first on each machine via db-out. And then you'd need to collect all five files unto a single machine so that you can import it via db-in.

Does this help?

db-out from every machines and collect them, while merging

Traceback (most recent call last):
File "/home/kushal/Documents/spacyprodigy/demofirt/demo/scriptt.py", line 11, in
for obj in reader:
File "/home/kushal/Documents/spacyprodigy/.venv/lib/python3.10/site-packages/jsonlines/jsonlines.py", line 416, in iter
yield self.read(
File "/home/kushal/Documents/spacyprodigy/.venv/lib/python3.10/site-packages/jsonlines/jsonlines.py", line 289, in read
lineno, line = next(self._line_iter)
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/lib/python3.10/encodings/utf_8_sig.py", line 69, in _buffer_decode
return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

this error shows while using this script

import jsonlines
import os
path = "/home/kushal/Documents/spacyprodigy/annotated_data"
filenames = [f for f in os.listdir(path) if f.endswith(".jsonl")]

with jsonlines.open("resume_all.jsonl", mode='w') as writer:
for filename in filenames:
file_path = os.path.join(path, filename)
with jsonlines.open(file_path) as reader:
for obj in reader:
writer.write(obj)

and after importing via db-in and train it works.

You can use markdown syntax to highlight your code segments, which makes it easier to read/copy/paste on this forum. I'm mentioning this because it seems you've used a quote > instead of a codeblock to show code.

It seems that you're using a library, jsonlines that has trouble with reading in your data. Could you try again but using the srsly library? That's one that we maintain.

import pathlib
import srsly 

full_dataset = [] 
for path in pathlib.Path("path/to/folder").glob("*.jsonl"):
    for ex in srsly.read_jsonl(path):
        full_dataset.append(ex)

srsly.write_jsonl("path/to/folder/name.jsonl", full_dataset)

Today i just got confusion, i have 100's of text file, i need to convert it to the jsonl
and successfully converted to jsonl for NER. Now i get all the text file converted into a single jsonl files.
image

!python -m prodigy ner.manual resume blank:en /home/kushal/Documents/spacyprodigy/output.jsonl --label ACADEMIC,EMAIL,ADDRESS,INSTITUTION,URL,PROFILE_NAME,PROFILE_CITY,PROFILE_ZIP,EXPERIENCE,HARDSKILLS,SOFTSKILLS,DATE,DEGREE,ROLE,DOMAIN,COMPANY

my main question is that : Am i doing the right thing? And one thing is that all these need to annotate by single person? or can i distrubute this to multiple person to that they can annotate parallely?

Hi Kushal,

it's a bit unclear to me what problem you're facing. Does the aforementioned code block help you merge multiple files together? That seemed to have been your main problem, but your new response suggests that you're dealing with files that are unlabelled? Is that correct?

If you want to distribute the annotation workload to multiple people then you can do that by allow annotators to pass their name into the session variable. The docs explain this in more detail here:

This helps me to merge all the jsonl files and to prepare a corpus this is helping me a lot.

Explaining this, i pass the same above prepared corpus to the file and when i try to use it for multiple annotators it works fine for only my machine i.e i opened different tabs and all of the sessions are getting different data which is fine but i need to do the same for the group of peoples in my office. Is there any other way that i can create a private link that can be accessed by my fellow mates and start the annotation process. I think you got me here. Can you share me some tutorials for this so that i can work on it. Thank you

You would need to host Prodigy on a IP adress that you're colleagues can reach. That means either provisioning a server or using a service like Ngrok. Be aware that if you're using Ngrok, you are exposing the annotation interface to the whole internet and anybody with the link is able to push annotations to you.

If you're interested in a Ngrok tutorial, you may appreciate this course form calmcode:

Thank you for this tutorial. i used this in prodigy.json: feed_overlap: false and when i create a private link using ngrok it works and sharing the link to fellow annotators at the same time. and when the link opens both anotators have the same set of data to work. What to do in this case?

Have you seen our docs on multi user annotation?