Best Practices for text classifier annotations

Yes, exactly, the last step between 75% and 100%. In your case, this actually looks very good ā€“ 13% improvement is pretty significant. So there's a high chance that you will keep seeing improvements if you collect more annotations similar to the ones you already have. It also confirms that your approach is very feasible, and that it makes sense to invest more and explore it further :blush:

(Really glad to see this is working well so far btw ā€“ this type of analysis and being able to verify an idea quickly is one of the central use cases we've had in mind for Prodigy!)

Thanks for the info, this is really helpful. I wrote the Twitter loader before Twitter introduced the 280 limit, and we actually launched Prodigy shortly after they announced the new API offerings. I've been meaning to look into this and see if it'll finally make the Twitter API less frustrating to work with. I'd love to drop the external dependency and use a more straightforward implementation.

Loaders are pretty simple, though, so you can also just write your own. All it needs to do is query the API and reformat the response to match Prodigy's JSON format:

def custom_loader():
    page = 0   # if API is paged, keep a counter
    while True:
        r = requests.get('http://some-api', params={'page': page})
        response = r.json()
        for item in response['results']:  # or however it's structured
            yield {'text': item['text']}  # etc.
        page += 1  # after page is exhausted, increment

Once you move past the experimental stage, you might also want to consider scraping the tweets manually instead of streaming them in via the live API. This way, you're also less dependent on arbitrary limitations and what Twitter decides to show you or not show you ā€“ assuming you're using the free API. (After all, the docs explicitly state that the API is focused on "relevance and not completeness".)

Sorry ā€“ this should probably be documented better. split_evals takes an already shuffled list of annotated examples and an optional eval_split (set via the command line). If no eval_split is set, it defaults to 0.5 for datasets of under 1000 examples, and 0.2 for larger sets. The function then splits the examples accordingly, and returns a (train_examples, eval_examples, eval_split) triple.

The ignore examples have no further purpose except for being ignored. They're still stored in the database in case you want to use or analyse them later. For example, if you're working with external annotators, you might want to look at the examples they ended up ignoring to find out what was most unclear, and whether there were any problems with your data. In some cases, you might also want to go back and re-annotate them later on.

Now that I think about it, Prodigy should probably filter out the ignore examples before doing the shuffling and splitting. Will fix this in the next release :+1: Maybe we could also add an option to the prodigy drop command that lets you delete example with a certain answer, if that's helpful? For example, you could do prodigy drop my_dataset --answer ignore and it'd remove all ignores from the set.

1 Like