Old Error: Only cupy arrays can be concatenated

i am playing around with prodigy currently to get used to its basic functionalities.
most likely i am doing something wrong here.

Running on:

Spacy 3.3.0
Prodigy  1.11.7

Calling

python -m prodigy train --ner db_skill_init -m de_dep_news_trf -es 0.2

works just fine.

Calling

python -m prodigy train --ner db_skill_init -m de_dep_news_trf -es 0.2 --gpu-id 0

throws error "Only cupy arrays can be concatenated" which was marked as solved 3 years ago (see: TextCategorizer: TypeError: Only cupy arrays can be concatenated on v2.1.0a10 · Issue #3355 · explosion/spaCy · GitHub) :confused:

Made a fresh install of Spacy, Prodigy & thinc, still get the error :frowning:

> 2022-05-16 07:51:28,371] [INFO] Set up nlp object from config
> Components: ner
> Merging training and evaluation data for 1 components
>   - [ner] Training: 115 | Evaluation: 28 (20% split)
> Training: 115 | Evaluation: 28
> Labels: ner (2)
> [2022-05-16 07:51:28,577] [INFO] Pipeline: ['tok2vec', 'transformer', 'tagger', 'morphologizer', 'parser', 'lemmatizer', 'attribute_ruler', 'ner']
> [2022-05-16 07:51:28,577] [INFO] Resuming training for: ['transformer']
> [2022-05-16 07:51:28,585] [INFO] Created vocabulary
> [2022-05-16 07:51:28,586] [INFO] Finished initializing nlp object
> Traceback (most recent call last):
>   File "C:\Users\Jan\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "C:\Users\Jan\anaconda3\lib\runpy.py", line 87, in _run_code
>     exec(code, run_globals)
>   File "C:\Users\Jan\AppData\Roaming\Python\Python39\site-packages\prodigy\__main__.py", line 61, in <module>
>     controller = recipe(*args, use_plac=True)
>   File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
>   File "C:\Users\Jan\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 367, in call
>     cmd, result = parser.consume(arglist)
>   File "C:\Users\Jan\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 232, in consume
>     return cmd, self.func(*(args + varargs + extraopts), **kwargs)
>   File "C:\Users\Jan\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 278, in train
>     return _train(
>   File "C:\Users\Jan\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 190, in _train
>     nlp = spacy_init_nlp(config, use_gpu=gpu_id)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\spacy\training\initialize.py", line 84, in init_nlp
>     nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\spacy\language.py", line 1309, in initialize
>     proc.initialize(get_examples, nlp=self, **p_settings)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\spacy\pipeline\tok2vec.py", line 220, in initialize
>     self.model.initialize(X=doc_sample)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\model.py", line 299, in initialize
>     self.init(self, X=X, Y=Y)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\layers\chain.py", line 90, in init
>     curr_input = layer.predict(curr_input)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\model.py", line 315, in predict
>     return self._func(self, X, is_train=False)[0]
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\layers\with_array.py", line 40, in forward
>     return _list_forward(cast(Model[List2d, List2d], model), Xseq, is_train)
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\layers\with_array.py", line 75, in _list_forward
>     Xf = layer.ops.flatten(Xs, pad=pad)  # type: ignore
>   File "C:\Users\Jan\anaconda3\lib\site-packages\thinc\backends\ops.py", line 250, in flatten
>     result = xp.concatenate(X)
>   File "<__array_function__ internals>", line 5, in concatenate
>   File "cupy\_core\core.pyx", line 1613, in cupy._core.core.ndarray.__array_function__
>   File "C:\Users\Jan\anaconda3\lib\site-packages\cupy\_manipulation\join.py", line 60, in concatenate
>     return _core.concatenate_method(tup, axis, out, dtype, casting)
>   File "cupy\_core\_routines_manipulation.pyx", line 534, in cupy._core._routines_manipulation.concatenate_method
>   File "cupy\_core\_routines_manipulation.pyx", line 553, in cupy._core._routines_manipulation.concatenate_method
> TypeError: Only cupy arrays can be concatenated

Can anyone help me on this? Can't get rid of this Error.

We're having a bit of trouble reproducing this error locally, but there are some follow-up questions.

Is there a reason why you're interested in using the transformer model as a base model instead of de_core_news_md or de_core_news_lg? The error you're experiencing seems to be related to the transformers part of the codebase, which the earlier models won't need. You can also just refer to the blank:de model. Have you tried that?

hi, thanks for the reply. :slight_smile:

I want to train a transformer model for NER by the prodigy-labeled data afterwards.
To get the Spans for the Annotations correctly fitting the transformers Tokenizer i was told that i'd have to use the appropriate tokenization from the beginning.
(When i initially wanted to start last year i hit missing tokens/gaps in the spacy-model, so this apparently had some rare limitations.) Aint that necessary anymore?

My original idea/expectation was that i can just label the data using prodigy and a word-ish tokenizer/embedding space like de_core_news_lg and once finished there is some tool avail within prodigy to generate a "neutral" export for data and the annotations. In case of NER e.g. the BIO-Format.

Now that ive already used the transformer tokenizer i assume all the data stored in the DB will be in this format. So can i just change the model used "within"?

Another Question/ Feature Request :scream: if i may:

I use ner.manual and a pattern JSON-File. I've labeled around 200 sets of data (looong texts) and it turns out to be a real pain in the ass that i have to manually relabel terms again and again that are not in the patterns dict.

So my idea is, wouldn't it be more skilfull, if there was an auto-additive ner.manuel that i can feed my inital Pattern-JSON (storing it in the database as well) and after having made the annotations and submit them, these terms get (automatically) added to the patterns-db, so for the next text they will be auto marked as well.
This would speed up ner.manuel and help fostering Pattern-Collections for differned purposes.
Or a functionalty fo harvest already made annotation and export them to pattern JSON-Format would also be helpful.....

I think this goes into the direction of ner.teach (which is supposed not to work appropriately with transformers (too big)?). Good/Bad Idea?

Let's answer both questions then. :smile:

If you have a look at the docs on NER and transformers you can see this comment.

New in Prodigy v1.11 and spaCy v3
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

So the token alignment should be taken care of, assuming that you're using a modern version of Prodigy and spaCy v3.

Now to answer your question on pattern files. You can write a script that re-recreates the patterns file from labelled entities.

This will require a little bit of custom code, but it's a trick that I have used in the past. The code I used was a script similar to this:

import srsly
from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("<dataset name>")

patterns = list(srsly.read_jsonl("orig_patterns.jsonl"))

for e in examples:
    for span in e['spans']:
        relevant_text = e['text'][span['start']:span['end']]
        pattern = {"pattern": relevant_text, "label": span['label']}
        if pattern not in patterns:
            patterns.append(pattern)
            print(pattern)

srsly.write_jsonl("new_patterns.jsonl", patterns)

This should append new strings as patterns to detect compatible with what Prodigy expects and you can pass the new_patterns.jsonl to Prodigy. This technique will require you to restart your Prodigy server once in a while though because the patterns file is loaded in once at the start.

Just to check, have you explored the Vectors and Terminology section on the docs? There are some great techniques shown here that will help increase your patterns file as well.

Thanks for the reply!

New in Prodigy v1.11 and spaCy v3
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

Yes, but isn't that actually the problem here?
I am using a spacy transformer-based pipeline de_dep_news_trf in Prodigy as stated in the documentation above.
It's working just fine with ner.manual, but it crashed with ner.teach - only when trying to use a GPU.

But its no problem if i can just use de_core_news_lg for annotation purposes and thus avoid the transformer pipeline since it doesn't seem to have bigger advantages for annotation purposes?

import srsly
from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("<dataset name>")

patterns = list(srsly.read_jsonl("orig_patterns.jsonl"))

for e in examples:
    for span in e['spans']:
        relevant_text = e['text'][span['start']:span['end']]
        pattern = {"pattern": relevant_text, "label": span['label']}
        if pattern not in patterns:
            patterns.append(pattern)
            print(pattern)

srsly.write_jsonl("new_patterns.jsonl", patterns)

Thank you so much! :smiley: :smiley:

The srsly.read_jsonl and write.jsonl are offering some magic :slight_smile:
It replaces german special characters like german umlauts "äöü etc." in the UFT-8 Text with their unicode-encodings (i guess for some reason).
Can this be suppressed with an option somehow so the text stills stays readible for humans for curating purposes?

Once Annotation is done (which will be speed up remarkebly with the extended pattern file now) i won't be using spacy however. So data-to-spacy may not be the first choice.

I will just need to export the annoted data from the DB in an application-neutral export formats like BIO or COLLN to feed that directly into some NER-Frameworks (not supported by spacy).
I guess this will also be as easy to generate as the snippet you provided above?

First, on the topic of srsly. The way data is stored on disk may indeed look different to how it appears when it is back in memory. But let's check what it looks like when we load it back in.

import srsly

# This example contains text that we're interested in.
example = {"text": "äöü"}
srsly.write_jsonl("example.jsonl", [example])

next(srsly.read_jsonl("example.jsonl"))
# {'text': 'äöü'}

I'm not familiar with other export formats like BIO or COLLN but if you can script it in Python, you can get the data in any format you like. I am curious to hear why you're considering doing the actual NER modelling outside of the spaCy stack though. Could you expand on that? Did you try running a model from scratch or with en_core_web_lg as a starting point?

Yes, but i need to edit the text-document outside of a python-enviroment for curating reasons.
So the question was if srsly also can do an export to output UFT-8 encoded raw-text file without having the umlauts converted to their unicode-format? (In the import JSON module e.g. ensure_ascii=False can be set.)
If i convert a document to UFT-8 Encoding in an editor (eg Notepad++) - the special charaters are still stored/shown in human readable format.

I intend to be using the TNER-Framework to train on the annotated data - this offers everything i need within a single function call. So i don't seem to have the need to use spacy at all.

I'd just have to convert out the annoted data from prodigy directly in the very simple IOB-File-Format, which is

1 Token per Line followed by its Entity-Tag

I O
like O
Prodigy B-ORG
! 0

If appears that Spacy Convert has an IOB Export function (example file) as well as theres a module called CONLL Spacy. But thats all within Spacy, not Prodigy.

i managed to do an IOB-Export out of the DB with this.
I am not sure if thats correctly though (are the spans correct, tokenization?)
Also i'd have to do a proper train,test, valid split on the whole annotated sentences beforehand. :sleepy:

import codecs
from prodigy.components.db import connect

db = connect()
prodigy_annotations = db.get_dataset("db_skill_init")
examples = ((eg["text"], eg) for eg in prodigy_annotations)

#nlp = spacy.blank("de")
#nlp = spacy.load("de_dep_news_trf", exclude=["tagger","morphologizer","parser","lemmatizer","attribute_ruler"])
nlp = spacy.load("de_core_news_lg", exclude=["tagger","morphologizer","parser","lemmatizer","attribute_ruler"])

with codecs.open('./export/iob.txt', 'w', encoding='utf-8') as outfile:
    for doc, eg in nlp.pipe(examples, as_tuples=True):
            for t in doc:
                piece = str(t) + ' ' + f"{t.ent_iob_}-{t.ent_type_}" + '\n' if t.ent_type_ else str(t) + ' ' +  "O" + '\n'
                outfile.write(piece)

I think you could also first export to spaCy using Prodigy, after which you could use the convert command from spaCy. The benefit is that the Prodigy command comes with a --eval-split setting. I have never exported to IOB before, but this might be worth a try.