ner.manual order of texts

Apologies if this is a duplicate of another question-- I've searched the forum and can't seem to find an answer that applies to this situation!

We're using prodigy's ner.manual recipe to annotate a set of 17,000 texts (some a few sentences and some a few paragraphs) for a new label with a blank:en model. We've had a couple of fits and starts-- perfectly normal-- as we adjust our annotation strategy, patterns, and the set of texts that we're using in jsonl. As a result, we've done a bit of db-in-ing, -out-ing, and db-deleting, and are ready to get going again.

Our use case involves annotating sets of texts in a specific sequence, but after regenerating the jsonl with the texts grouped in desired order, it appears that Prodigy always starts at a random line. Some of our db futzing was an attempt to clear what we thought might be an index cache at which Prodigy begins serving texts.

  1. Is this how the loader for jsonl is intended to perform (loading at what we perceive to be a random jsonl line)?
  2. I've tried a time or two to modify the ner.manual recipe to turn the generator into a list, but that results in hanging at startup because I assume that prodigy/spacy are trying to tokenize everything at once-- so I think that wouldn't solve the problem/is not practicable. Would using a different loader make a difference to the order in which texts are presented?

Thanks, as always, for your excellent work!


Edit: I used my special reading eyes on the loader documentation and will see if I can whip up a custom loader to do what I need!

Hi! The JSONL loader will load the file line by line and send out the examples in order. There's no magic going on here, and it really just iterates over the lines.

Prodigy will skip duplicates and examples that are already in the dataset – so if you're using the same dataset for all your experiments and you annotate the first 20 texts in the first run and start the server again, Prodigy will resume at text 21. That's typically the desired behaviour, since you want to start where you left off and not repeat any examples. If you want to start at the beginning again, the easiest and cleanest solution would be to use a fresh dataset.

If you want to make sure that Prodigy only sends out a new batch of questions when all previous questions have been answered, you can also set "force_stream_order": true in your prodigy.json. By default, if you open up the app twice on two different devices, you'd get two different batches of examples: the first one, and the second one. Prodigy will then wait to receive the answers. With "force_stream_order": true, Prodigy will keep sending the first batch until it has received the answers and then move on to batch 2. This can be relevant if the order of quesions matters a lot, and you don't want it to be disrupted if the user refreshes the app. Just make sure you only have one user per session then – otherwise, you'll end up with duplicates.

Yeah, if you call list, you're evaluating the whole generator and are essentially tokenizing and loading 17k texts into memory. That's one of the main reasons we chose JSONL as the default file format: it can be read in line by line and using generators, you can process the inocming texts in smaller batches, perform potentially expensive preprocessing and respond to outside state (like an updated model).

Thanks, @ines! I've done some further testing. Here's a reproducible example.

Using this jsonl with Prodigy 1.9.4:

{"text": "Bippity Boppity Boop School District"}
{"text": "Lippity Loppity Loop County Office"}
{"text": "Our team has supported school district leaders and school staff."}

  • recipe: ner.manual
  • fresh dataset
  • blank:en for tokenizing
  • and a set of patterns that validate

I am getting:

✘ Error while validating stream: no first example
This likely means that your stream is empty.

I have one hypothesis that I'd love your take on:

The basic logging is showing:

11:24:48: FEED: Finding next batch of questions in stream
11:24:48: CONTROLLER: Validating the first batch for session: None
11:24:48: PREPROCESS: Tokenizing examples
11:24:48: FILTER: Filtering duplicates from stream
11:24:48: FILTER: Filtering out empty examples for key 'text'

Is the controller supposed to be running before the preprocessing step? Is fastapi... too fast?

The controller is created when you execute the recipe function, and it puts together the stream, config, etc., which is then used to start the server. So this shouldn't be the problem.

I just tried it locally and I think what's happening here is this: The new ner.manual with --patterns currently only shows examples containing matches, not all examples in the data with optional matches (like in ner.correct). So if there are no matches in a text, it's skipped, and first examples you see first examples with matches. And if there are no matches, you'll get an empty stream.

For the next release, we'll add a setting to the pattern matcher that lets it output all texts, with matches and without. We can then set it in the ner.manual recipe, because I think it'd be a much more reasonable default.

Tremendous— thanks as always for the prompt help and guidance!

While we await the next release, we may go ahead and continue annotating the pattern-matched examples. Just to confirm, can annotations collected in this way be incorporated into future manual runs? Should we set exclude_by to “input” or will the hash by task be the same? Apologies for my lack of clarity re hashing.

Yes, I think the best solution for now would be to do two runs: one with --patterns to annotate all matches, and one without --patterns to annotate all examples that don't contain matches.

The ner.manual recipe excludes examples based on their input hahes, so two examples will be considered identical if their input (text text) is the same, even if they contain different suggestions (highlighted pattern matches etc.).

So if you do one annotation run with --patterns, annotate all examples containing matches and then restart the server without --patterns, Prodigy will skip all texts in the stream that are already present in the dataset. (In contrast, if you chose to exclude by task hash, you'd see all texts again, because text with pattern suggestion != text without pattern suggestion.)

1 Like

Super clear, thanks so much!

An update for posterity! Nothing below is urgent, so hope you all have a great holiday.

For one new label and using domain-specific texts:

  • Created and reviewed 625 annotations from about 17k patterns (generated from some publicly available data) and about 17k texts. This exhausted the matched patterns so now we're on to unmatched.
  • Decided to treat self by checking out the training curve, and glad I did: using those 625 examples and the en_vectors_web_lg starter model, we're at 89.24 accuracy (F Score?) and set to improve with further annotations. This is astounding.

Reflections:

  • Really happy we went with manual annotation-- I reasoned that even if we didn't end up with a useful model, we'd at least have a source of truth for the current set of texts. But it looks like we will end up with a model.
  • The accuracy score is so high as to be both astounding and to make me a bit skeptical, so I'll be doing some digging. Certainly, we'll be sanity-checking this model against additional examples, and I'm looking forward to seeing how further manual non-pattern-matched annotations affect the accuracy.
  • I'm not convinced that I have a source of text that'd be worth pretraining against, nor do I have the ready GPU access to do some quick tests. So my instinct is to hold off on that and continue annotation.

Some new questions:

  • Related to GloVe vectors, is en_vectors_web_lg derived from the 840B Common Crawl token set here? I couldn't tell for sure from the SpaCy docs or repo.
  • I've had some success using the xx_wiki_ent_sm model for NER in the past and think wiki vectors might work better for my use case. If one wanted to train against a different GloVe download such as the wiki vectors, should one still use this script for conversion and then drop in the path to the output at the command line? (h/t and thanks @justindujardin). I am happy to test it out but am just curious as to whether SpaCy internals/io for vectors have shifted in the meantime, off the top of your head.
  • I found a comment on this forum a helpful way to think about ner.teach: it's a great annotator-helper. With that in mind, might we use ner.teach with a model based on the matched patterns to annotate the texts that didn't have matching patterns, or best to soldier on with ner.manual? As I said earlier, my instinct is to stick with what's working and get to some ground truth for use down the line, but interested in your take.

Consider this validation for your encouragement here and elsewhere that ner.manual is a great place to start when you have a new label, and hats off to the s2v blog post for lighting the way!

Thanks very much for all you and the team do, and Happy New Year.

Edit: to the final bulleted question re: ner.teach, I'm happily using ner.correct at the moment. The former "make gold" terminology now makes more sense!

Thanks for the detailed report! Always super valuable to get this sort of feedback.

I think if the problem isn't too difficult 89.2 F-score already might be an accurate representation. I think it makes sense to switch over to the ner.correct recipe, so you're using the model to help you annotate: https://prodi.gy/docs/recipes#ner-correct , rather than just the patterns.

You could also try the ner.teach, with a model at 90% accuracy already it can be a good time-saver. I would save the ner.teach annotations into a different dataset to keep them separate (since the binary data is different, as it's incomplete information). Then you would run the ner.silver-to-gold recipe to convert from the incomplete annotations to completed ones, which you can merge into the rest of your dataset.

About the GloVe vectors: yes, it's the 840b common crawl vectors. But regarding the xx_ent_sm model, I think it's probably more effective to focus on the annotation at this point, rather than changing up stuff like the source of vectors or the model you're using. Those things will probably make a relatively small difference, compared to just getting more data.

Thanks, @honnibal! I'll keep the thread updated through the end of this annotation and training run in case others find it useful when starting from 0, but we're moving along swiftly, happily, and with no game-stopper questions. Credit to you and the team's work.

Getting to an 89 F-score off of a few hours of annotations far surpassed my expectations, so good to know it's not beyond the pale. I think my trepidation was more about anticipating questions about how/why the results are so good and being able to answer them-- so, thanks for the documentation.

Thanks, that was my instinct as well. At this point, the questions about different inputs to the model and options for training are a welcome distraction when annotation eyes get tired, but I'm unlikely to change the base approach from ner.correct-- rather, might spin up an ner.train session or try a new set of base vectors as a treat at the end of the day.

Just one thought and one question re: end-of-day treats:

  • I've been trying to determine whether to re-train an existing model when we add a bunch of new annotations to the gold set vs. training a new model. I suspect the recommendation is that this will end up being a subjective call in the end, and that Prodigy enables experimentation with new vs. existing models via A/B, but if there are recommended best practices for creating and updating models as we add annotations, I'd love to read through them. (I'm enjoying seeing the comparison to the existing model's baseline when updating, though, and fwiw/probably to be expected, the re-trained model is showing smaller loss and an F-score of about a point higher after increasing the n of annotations from ~600 to ~1500. We'll probably do a "winner vs new challenger" tournament when new annotations are added.)
  • EDIT: this piece of the documentation guides me toward retraining from scratch, though in the context of ner.teach: "To prevent unintended side-effects, you typically want to train the base model from scratch using all annotations every time you train – for example, you want to update en_core_web_sm with all annotations from one or more datasets and not update batch-trained-model , save the result, update that again and so on." So I think that answers the question.
  • Optimizing for recall: the models are exhibiting higher precision than recall (our high of 90.7 has ~94.5 for precision and ~87.2 for recall), but for our use case we might prefer more false positives if we had to trade off between the two. Is there a way to operationalize this preference within the flow of Prodigy/SpaCy? This is a low priority given our happiness with the results but might come up when fine-tuning.

Many, many thanks.

Adam