custom sense2vec or pre-trained vector

Hello,
To begin with, its been super fun using prodigy and its features!

Me and my team have a script running to grab Tweets from stream that are earthquake-relevant.
Our goal is to multi-label these tweets as Natural_Hazards, Casualties, Impact, High_intensity e.t.c.

But i have one question about the sense2vec vectors
@ines
Should i build a custom sense2vec vector, big enough, consisting only earthquake related tweets, or should i just get a pre-trained general Twitter vector, like the one GloVe is offering (which is like 2B tweets).
And if i end up training a custom sense2vec vector, are the scripts to build that vector in the sense2vec repository enough to preprocess these data, or is it needed to hard-code some cleaning on them, for example removing punctuations, stopwords, stripping entities (like emojis, user_mentions, hashtags etc), expand contractions. Would that hard-coded preprocess on the corpus mess up the training of the vector?

Thanks, that's nice to hear :slightly_smiling_face:

If you have enough raw text that's more specific, training your own sense2vec model could definitely be useful and it would likely produce better suggestions. I don't know that much about earthquakes but I'm sure there are a lot of expressions that are quite common in this context, but very rarely mentioned outside of it.

The scripts in the repo should cover all preprocessing needed to train the vectors (merging multi-word expressions, concatenating tags and entity labels if available etc.).

That said, depending on the quality of your data, some cleanup might help, like to remove broken markup or links. Just make sure you're not preprocessing too much because you may accidentally destroy potentially relevant data points. For example, emoji can be pretty interesting: sense2vec: Semantic Analysis of the Reddit Hivemind ยท Explosion Same with Twitter usernames or hashtags โ€“ if you keep them intact, you could query your vectors for similar hashtags or users, which could be super interesting as well.

1 Like

Hey, thanks for the quick reply to begin with!
Follow up to my project, but don't know if its relevant to this thread (sorry if it isn't)

Is it possible to annotate whole sentence (tweet) as one or more categories/labels, based on certain vocabularies/word-annotations or generally to classify sentences as one or more categories? An example:
Let's say i have an Impact.txt, DidYouFeelIt.txt and more annotation, each file containing words relevant to that category.
Am i able to categorize/classify a whole tweet based on these words?

I tried annotating tweets myself

textcat.manual py_earthquakes ./dataset.jsonl --label Impact,DidYouFeelIt,Spam

and while i annotated like 300 tweets, choosing between 1 or more categories (or Rejecting some), the terms.to_patterns recipe returned a jsonl such as:

{"label":"Impact,DidYouFeelIt,Spam","pattern":"earthquake! at the pandemic"}

Does that mean that it didn't even annotate the whole sentence as desirable? (i classified this as Impact only)
Searched a bit online about similar problems and i haven't seen anyone classifying a whole sentences based on words in their contents (or generally multi-label classification).
I feel a bit confused to be honest. If you can give me any guidance, it would be super helpful!
Thanks a lot in advance

What are you trying to achieve here by calling terms.to-patterns? I think this is likely what's causing the confusion here, because the idea of terms.to-patterns is to create match patterns based on single words or phrases created with a recipe like terms.teach or sense2vec.teach. So it expects single words with a label that you've accepted or rejected, and will then generate a patterns file you can use to match these words. It doesn't really make sense to do this conversion process with text classification annotations.