Hi! There's no one correct answer and how you structure your projects depends on your specific use case. When you're starting a new project that involves training a model and collecting new data, you'll usually have two distinct phases:
Development phase. During this phase, you'd validate ideas for the model and it typically requires the data scientists/developers working very closely with the annotators, and often even annotating small samples themselves to test what works best. Even if the label scheme sounds reasonable in theory, it often happens that it turns out to be quite difficult to annotate (annotators don't agree, no clear boundaries) or diffiicult to train (unclear distinctions, model struggles to learn, task is a much better fit for textcat instead of NER etc.) These are all issues that you ideally want to resolve before you scale things up (not after you've already spent hours labelling data with a bad label scheme).
Data collection phase. That's when you focus on actually labelling the data and create a corpus that's large enough to achieve good results. If multiple people are annotating, you often want to introduce some overlap, so you can check if your annotators agree. You also want to make sure you're not asking your annotators to do anything that can easily be automated, as this just introduces more possibilities for human mistakes and lower data quality. That's also where some of Prodigy's semi-automated workflows come in handy: you can use a model to pre-highlight suggestions, like in
ner.correct, or use
--patterns to pre-select suggestions from a dictionary. Also, don't forget labelling enough data so you have a dedicated evaluation set!
During development, you typically want to run frequent training experiments and compare the results. The
train-curve workflow is also very useful to test how more data is improving the model and to detect potential problems early on.
Using a single shared database is totally fine. When you start the Prodigy server, you can specify the name of the dataset to save annotations to. You can think of a "dataset" as a "single unit of work". In the beginning and while you're experimenting with different workflows, it's probably a good idea to use separate datasets for everything you're doing, as it makes it easy to start over. If you make a mistake, you can just delete your dataset and start over. Merging is easy, and you can always train from multiple datasets at once.
You can also use the
review recipe (also see here for UI examples) to double-check annotations collected on the same data by multiple people / in multiple datasets. Even without running complex annotator-agreement metrics, it usually makes it obvious pretty quickly if there are significant problems. If all datasets disagree constantly, maybe the label scheme is unclear. If one dataset always disagrees with everyone else, maybe that annotator misunderstood the objective.
For completeness, Prodigy also supports named multi-user sessions, so you can have multiple named annotators on the same session. However, it does add a level of complexity that's not needed for most use cases, IMO. If you can run several separate instances, that's often more flexible and gives you more control.
ner.teach is indeed a bit special in this way, since it updates a model in the loop. It's not the best idea to have multiple people annotate within the same session, because if they disagree, they might be moving the model in different directions, resulting in less useful suggestion and less useful data. That said, it's also the workflow that's fastest to annotate and that you typically don't need as many examples of, since you're improving an already existing pretrained model. So it also makes less sense to have multiple people on it at the same time.
How you divide up the work in your team also depends onthe team members: if everyone is a developer, that's pretty cool, because you can try out more ideas at the same time. For example, let's say your goal is to extract information about sales of certain products. There are lots of ways to get there and what works depends on many factors, including your data, the modelling approach and so on. To find out which works best, you need to try them, so you can divide the different up in your team. For example:
Experiment 1: How easy is it to update an existing pretrained model and fine-tune it on the data? Try using
ner.correct to collect gold-standard data for the labels
MONEY with an existing model and update it.
Experiment 2: Maybe training from scratch works better, because we don't have to constantly "fight" the existing weights? Try collect more annotations, train again and see if it can beat the results of experiment 1.
Experiment 3: Maybe
PRODUCT is too vague. Is it viable to use more fine-grained labels instead? Can the model learn that distinction? Try annotating the same data with a different label scheme and compare results.
Experiment 4: Can we train a text classifier to predict whether a sentence is about a sale? In combination with the generic entities
MONEY, this would let us extract the info we need.
Experiment 5: Does the text classifier actually perform better than a simple keyword search for "trigger words" like "sold" or "bought"? That's another important data point you need to make decisions.
(Matt has a talk that explains the iterative data collection philosophy behind this in more detail, if you're interested.)