Relevant Text Highlighting, Ports, multi-threading, reproducibility of results, cross validation

Dear Prodigy support,

I am working on text classification, I would have following questions:

1. Relevant Text Highlighting:
For active learning sessions (when long text option is selected), Prodigy highlights a portion of the text which the model considers relevant to the topic and the same is recorded in the dataset json export in a ‘spans’ key.
Is there a way to achieve the same for prediction from the Spacy model obtained after usage of Prodigy’s textcat.batchtrain?

2. Multiple Ports / Resuming Active Learning session:
I am using Prodigy to train an Engineering Text Classification Model. For this task, I schedule active learning session for experts from different engineering domains. The domain experts are available for only a small amount of time within which they could not complete an active learning session. I could not start another active learning session (for the next scheduled training with another domain expert) without closing the previous session. If I interrupt the active learning session, I am unable to resume from where I stopped (Instead the active learning starts from the beginning).
I would like to know if there is some option that will enable me to start different active learning sessions in different ports (like 8080, 8081, 8082) so that I will never have to interrupt an active learning session in the middle.
Also, I would like to know if there is an option in Prodigy that allows me to pause and resume the active learning later.

3. Multi-Processing:
Currently, I am training a medium language model on 4000+ annotations which take around 15~20 minutes to complete. I would like to know if adding more CPU power like multiple processors will increase the training speed. Is there an option in Prodigy similar to njobs (number of parallel processes to be started) of SciKitLearn?

4. Reproducibility of Results:
If I train two models with same parameters (batch size, number of iterations, evaluation split), I get two models that are different from each other. Is there some parameter in prodigy which is equivalent to randomstate of SciKitLearn that allows us to reproduce the same results?

5. Cross-Validation
is it possible to perform cross-validation to avoid the above problem? maybe using SciKit learn with the selected spacy model?

thank you
kind regards

Claudio Nespoli

Yes. The long-text mode cuts the document into sentences, and then makes the predictions on the per-sentence level. The sentence predictions are then aggregated to make the document prediction. To get the same approach, you'll want to pre-process your text into sentences, and then turn them into Doc objects before passing them into the text classifier. It's probably worth subclassing the spacy.pipeline.TextCategorizer class to implement this. You'll probably want to use spaCy's extension attributes to hold the per-sentence predictions, so you can access them at run-time.

You can change the port using the PRODIGY_PORT environment variable. You might be interested in using a reverse proxy such as Traefik ( www.traefik.io ) to map the services to better URLs. We have an extension product coming out later this year that will make this much easier --- it will provide a web app that lets you start and stop the annotation tasks, and allocate them to annotators.

The alternative we would suggest is to use the batch-train command to learn from the current annotations, and then use the model from that as the basis for the annotation. This should start you from a better position, the model is able to train to convergence on the annotations. Online learning (as is done during the annotation session) is a tricky constraint -- it's much easier if the whole dataset is available.

Thank you, actually, I would need to classify subparts of the document taking into account the content of the other parts of the document that could provide some more context.

I do not need to make prediction per sentence independently, but a prediction on a sentence or piece of text based on all doc contents. Is that possible?

Thank you in advance

You should be able to highlight a span of text to display within the text classification view, by setting a "span" key. However, the statistical model in spaCy that Prodigy uses won’t be able to take advantage of features from outside of the text being classified.

Are you sure the surrounding document context is critical? In general it’s pretty hard to design machine learning models that easily learn this type of relationship. It’s possible, but they’ll require a lot of training data.