Confusion with loading a raw dataset for texcat.teach, all answers marked as "accept" ?

I'm planning to run a binary textcat.teach on a corpus of raw texts, and I'm a bit confused by the process.

$ prodigy dataset social-texts

  ✨  Successfully added 'social-texts' to database SQLite.

$ prodigy db-in social-texts ./data/social_text_data_1.jsonl

  ✨  Imported 10000 annotations for 'social-texts' to database SQLite
  Added 'accept' answer to 10000 annotations
  Session ID: 2019-11-21_13-14-54

These are raw texts that don't have any annotations yet, where one line of social_text_data_1.jsonl is like:

{"text": "i can't believe the service on American Airlines! It's so terrible @aa #badflights"}

I'm confused as to this message upon loading the dataset: Added 'accept' answer to 10000 annotations

Is there a different way to load a corpus of raw texts for annotation that doesn't assume the examples are all 'Accept'?

Hi! I think the solution might be a lot simpler :slightly_smiling_face: Prodigy doesn't require you to upload any data before you start annotating – so you can pass your social_text_data_1.jsonl to the textcat.teach recipe as the source argument and it'll load the data from a file.

The datasets in the database only store the collected annotations. So the db-in command to import data is mostly intended to load in already annotated examples. That's also why it adds the answer by default.

1 Like

Thank you! That makes sense.

EDIT: Moving my follow-up question to another thread.