Is there a vectorized way to get [label, text]?

Hi Ines!

I am wondering if there is a way that I can get [label, text] in one shot as a vector without looping through? I understand that ner.pipe does it for you for all texts. But what you get back is an object that contains .ents with 'Lbel' and 'text' in them.

Thank you very much!

I'm not 100% sure I understand the question correctly – so you want to extract a vector of the text and entity as a vector/array instead of strings?

Internally, spaCy stores everything as IDs and the strings are only computed when you access them. Same with the Span objects like the doc.ents, which are only views of the Doc. So instead of accessing the entity spans and getting their texts, you can also get the Token.orth, Token.ent_id and Token.ent_iob for each token, or use Doc.to_array for a single numpy array.

Thanks for your response, Ines!

To clarify, I created a list of text strings to feed into my trained prodigy model:
my_texts = list(my_pd_frame)
Run through my model:
mydocs = list(nlp.pipe(my_texts))

But, I cannot convert my mydocs list to np_array. Are you suggesting to convert nlp.pipe object?

Thank you very much in advance for your help!

mydocs here is a list of spaCy Doc objects. Doc objects provide various methods and attributes for accessing the annotations – for instance, Doc.to_array, which outputs the attributes you're interested in as a numpy array. For example:

my_np_arrays = [doc.to_array(["ORTH", "ENT_TYPE", "ENT_IOB"]) for doc in nlp.pipe(my_texts)]