Training Data after Using spans.manual

hmousa961 · August 16, 2021, 7:14am

Hello,
First of all, thanks for the amazing spans.manual tool in the nightly version, it's really amazing. I've used spans.manual to annotate overlapping labels. Now, I want to train a model using these annotations to predict overlapping labels in other datasets. Is there away to do it in prodigy nightly or I have to use the SpanCategorizer in spacy?

And secondly, do I need to train alot of examples in the model to be able to start predicting or few example (~60 examples) would be enough, because i tried using the --spancat to train a model, but when testing it on some dataset, it couldn't predict anything.

Thanks for the help.

hmousa961 · August 16, 2021, 7:56am

Also, after I trained a model using --spancat, if I gave the model for example 'John Smith Sports Center', it recognizes it as an organization and doesn't recognize John Smith as a person. Although, I have annotated almost 30 similar examples with labels ORG and PER in the dataset used in the training.

lnatprodigy · August 16, 2021, 8:40am

This sounds somewhat related to spancat - only one label being learned · Discussion #8967 · explosion/spaCy · GitHub

I'm having a similar problem but posted it over there, since it doesn't seem related to Prodigy.

hmousa961 · August 16, 2021, 9:36am

I checked the fix about changing

   kwargs.setdefault("multi_label", True)

   kwargs.setdefault("allow_overlap", True)

But it didn't work with me as well. It still gives score zero for one label. Do you think training more examples might solve it?

lnatprodigy · August 16, 2021, 9:45am

Try this: spancat - only one label being learned · Discussion #8967 · explosion/spaCy · GitHub

hmousa961 · August 16, 2021, 6:54pm

First, thank you for replying. With my annotations, I tried copying the train.spacy into dev.spacy but it didn't work with me. I even checked by putting the same dataset into the model and for each example it predicts right for the examples where no overlapping such as (John Smith is person). But still if the example was "John Smith Sports Center", the doc.spans gives me only one span and labelled ORG.

Do you think the ngram from the suggester affects the prediction?

SofieVL · August 17, 2021, 7:25am

Hi @hmousa961 and @lnatprodigy !

I just wanted to ask both of you which version of spaCy you're using, and whether you're using a pip-installed release, or compiling from the master branch. Since last week, a bug crept into the spancat, resulting in incomplete predictions. So that might explain things if you've been compiling from the master branch. The fix is here: Fix making span_group by svlandeg · Pull Request #8975 · explosion/spaCy · GitHub

If you've been working with a pip-installed release, that bug shouldn't be present though. In that case, what happens when you run training and you specify the same set as train AND dev set? Does it correctly overfit?

hmousa961 · August 17, 2021, 8:00am

Hi @SofieVL
I am using spacy version 3.1.1 and I used a pip-installed release. For me, when I specify both as train sets, it does overfit but it shows me one label only (out of 2). Any overlapped labels, it always dismiss one of them. I tried also to increase my annotations example upto 140, but nothing has changed regarding the overlapping predicition.

lnatprodigy · August 17, 2021, 8:24am

I'm using pip installed 3.1.1 and I do get almost all the labels when predicting on train data. I think my problem really comes down to requiring A TON of data

SofieVL · August 17, 2021, 8:34am

Hi @hmousa961 , thanks for checking! The default spancat configuration uses an ngram suggester with n=[1,2,3]. Have you verified that the other entity type is covered by these ngrams? If not, perhaps you need to specify longer ngrams?

hmousa961 · August 17, 2021, 8:50am

@SofieVL Thank you. I'll check the ngrams and will increase them and train the model again.
@lnatprodigy Is it possible for you to share the make_span_group part from the spancat.py file you have? Thanks a lot

lnatprodigy · August 17, 2021, 8:54am

I assume you mean _make_spancat_group() ?

    def _make_span_group(
        self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
    ) -> SpanGroup:
        spans = SpanGroup(doc, name=self.key)
        max_positive = self.cfg["max_positive"]
        threshold = self.cfg["threshold"]
        for i in range(indices.shape[0]):
            start = int(indices[i, 0])
            end = int(indices[i, 1])
            positives = []
            for j, score in enumerate(scores[i]):
                if score >= threshold:
                    positives.append((score, start, end, labels[j]))
            positives.sort(reverse=True)
            if max_positive:
                positives = positives[:max_positive]
            for score, start, end, label in positives:
                spans.append(Span(doc, start, end, label=label))
        return spans

hmousa961 · August 17, 2021, 8:55am

Yeah, thank you so much

hmousa961 · August 19, 2021, 7:28am

Hi @SofieVL
The other entity type is covered by the ngrams. I even specified longer ngrams but the issue stays. Do you think I should try train larger set of data?

Thanks for the help

SofieVL · August 19, 2021, 7:55am

Hi @hmousa961,

thanks for checking. The spancat feature is relatively new in spaCy, and I'll be doing some sanity checks in the next couple of weeks to ensure it's working correctly for all use-cases, as it does sounds like maybe there's a problem here. I'll keep you up-to-date here if I find anything!

lnatprodigy · August 19, 2021, 9:31am

Thank you @SofieVL !
Since spans.correct got implemented I was able to annotate about 1k samples for my usecase. All labels are learning decently (though not as good as with ner), but the overlapping ones stay flat 0, regardless of how much I annotate (despite the spans representing names, which sound extremely learnable). It is starting to feel like there is something wrong here.

SofieVL · August 19, 2021, 10:28am

Thanks for the confirmation. I've done some preliminary experiments and something definitely seems off with predicting overlapping entities. I'll investigate the spaCy code further.

SofieVL · August 20, 2021, 1:15pm

Hi all,

Thanks for your patience on this. We've just released a new version of spacy, v3.1.2, which should hopefully fix the issues you've been experimenting with training & evaluating the spancat. If you get a chance to test it, let us know how you go!

lnatprodigy · August 20, 2021, 5:08pm

Just tried it and the numbers have vastly improved!
Thank you for fixing this so quickly

hmousa961 · August 20, 2021, 8:47pm

Hi @SofieVL
Thank you so much. I just tried the new version of spacy and now it shows the overlapping labels. It needs more data to train on for accuracy, but now its working

Thanks again, really appreciate your help

Topic		Replies	Views
Span Cat Annotations and Incorrect Predictions spacy , spancat	4	844	June 8, 2023
Spancat is not trained spancat	12	1113	July 27, 2022
Losing spancat labels when training after using prodigy db-merge spacy , spancat	12	339	January 3, 2024
Span Categorizer - Labels Prediction usage , training , spancat	5	470	November 18, 2021
Training multiple Spacy models in Prodigy usage , spacy	7	887	January 11, 2022

Training Data after Using spans.manual

Related topics