documents length and annotation time

I attempted to upload a sample of texts from a csv dataframe (with a limited amount of metadata) through Jupyter Notebook and annotated them with the following command:

! prodigy ner.manual my_set blank:en ./random_directives.csv -- label ENTITY

The documents vary in length but some of them are thousands words long. Unfortunately, after just 10-15 documents prodigy starts to slow down and even briefly interrupt its activities. Is the issue due to the excessive document length? How can I solve the problem, beside reducing it and clean the documents as much as possible?

Prodigy doesn't require any data to be uploaded so when you start the server, the data is streamed in and then saved to the database as annotations come back.

A few thousand words shouldn't be a problem in terms of size – after all, it's just JSON being sent across a REST API. That said, you might want to set the PRODIGY_LOGGING=basic environment variable to see more logging info, maybe this will give you some clues what might take long.

That said, are you sure you want to annotate examples that are thousands of words long? It just makes annotation more difficult because your annotators have to read everything before they can submit a single answer and it takes longer to collect individual datapoints. There's also not really an advantage in annotating really long documents for NER, because your model's context window will always be much smaller. Also see here for background: https://prodi.gy/docs/named-entity-recognition#long-text

Thanks for your answer Ines! Following your tip, we cut down the length of the documents to annotate. In our cat that meant to scale down a little from entire laws to articles. Unfortunately, only one member of our three-people team is managing to have a have a regular annotation process. Is it a matter of computer's computing capacity? The team member is the only one not using a MacBook too: does that matter?

I tried to use the function PRODIGY_LOGGING=basic but it does not provide me with useful info about delays in the annotation process. What instead this function tells me is that Prodigy skips some documents in its way. I tried to analyze if the skipped documents have something in common among them and different from those that did not get skipped (weird characters, wrong format in the metadata and so on). However, I cannot identify a pattern. Why does that possibly happen?

Thanks a lot in advance

The type of computer and operating system shouldn't matter, but depending on the size of documents and the models you're working with, it's possible to run out of memory. Do you know how much RAM the other computers have?

We typically recommend using a machine with at least 8 and ideally 16gb RAM for working with spaCy (and for doing data science in Python in general, I guess).

What exactly did it say in the logs here? Prodigy will skip examples that are already present and annotated in the current dataset, or lines that can't be parsed and contain invalid JSON (if you're loading from a .jsonl file). You can easily test this yourself by loading in your data and calling json.loads on each line. If that fails, the line contains invalid JSON.

The type of computer and operating system shouldn't matter, but depending on the size of documents and the models you're working with, it's possible to run out of memory. Do you know how much RAM the other computers have?

Not really. Mine has 8gb RAM and it processes the documents better than one of my colleagues' laptop but worse than the another one. If this is the issue, I guess we should either further reduce the length of documents or work as much as possible on the most efficient computer in the team.

What exactly did it say in the logs here? Prodigy will skip examples that are already present and annotated in the current dataset, or lines that can't be parsed and contain invalid JSON (if you're loading from a .jsonl file). You can easily test this yourself by loading in your data and calling json.loads on each line. If that fails, the line contains invalid JSON.

This is an example of a skipped text as Prodigy reports it:
16:48:32: FEED: skipped: -903833756 this regulation shall enter into force on 4 march 2003.it shall apply from 5 to 18 march 2003.this regulation shall be binding in its entirety and directly applicable in all member states.

I do wonder what the number before the text (903833756 in this case) means.

I successfully parsed the whole file (that is in jsonl format) with the following code. So there does not seem to be invalid json in the lines of the file.

data = []
with open('path_to_file') as f:
for line in f:
data.append(json.loads(line))

Thanks again for the support

Yeah, that sounds reasonable. Another idea could be to set up a VM (cloud or on your local network) with Prodigy installed that your team members can SSH into. But that's a bit more work to set up.

This is the _task_hash of the example, i.e. its unique identifier. The fact that it's skipped in the feed here indicates that it was actually filtered out because it already exists in the current dataset. If you look at the examples in the dataset you're using, can you find an example with the same ID?

If you look at the examples in the dataset you're using, can you find an example with the same ID?

No, if for ID you mean the _task_hash. All the documents seem to have different ones. And so it goes for their _input_hashtag too. We are also always creating new dataset in running the various tests so to avoid this sort of issues.

Sorry and thanks for your patience.

Could you share the command you're running and the full output of your logs with PRODIGY_LOGGING=basic enabled?

Sure!

This is the command:

! PRODIGY_LOGGING=basic prodigy ner.manual test_2_11 blank:en ./random_sample.jsonl --label ENTITY

In the following screenshots, you can see the output:

I have no issues in sharing the dataset too, if that could be useful to spot the problem

Can you check if the task hash is somewhere else in the database, maybe in a dataset associated as a session with the current one? That's really the most likely explanation. You can use the db.get_examples method with a list of only that task hash:

print(db.get_examples([-903833756]))

Or, an even more thorough script that goes through all datasets and sessions:

task_hash = -903833756
for set_name in [*db.datasets, *db.sessions]:
    task_hashes = db.get_task_hashes(set_name)
    if task_hash in task_hashes:
        print(set_name)

You were right! Th task has was associated with existing sessions/datasets. I dropped from the sq lite database all the datasets containing the task hash using the command "db.drop_dataset". But the problem is still there, unfortunately. I guess it is linked to the sessions too. How can I solve it? Is there a command to clean the sessions in the database too? Thanks

Yeah, it sounds like you somehow ended up with stale links in the database, so if you want to be safe and make sure it's all removed, one option would be to use SQLite directly and check for the ID, and then remove it from the links and examples tables. For example using a too llike this: https://sqlitebrowser.org/

1 Like

Alright! Thanks. Is there a default name under which the sqlite database used by prodigy get stored in the computer? I can't find it n my home folder called "Prodigy", the one containing the recipes and components among other things

By default, it will be stored in a .prodigy directory in your user home, the same place as the prodigy.json – you can run prodigy stats on the command line and it should show you the full path. The default database file is called prodigy.db. (You can of course change all of that in the settings if you want.)