could i do this in prodigy?

I will be honest and say i am very new machine learning and NER. I have played with the classifiers for and found the quite easy to classify my string as a whole to a type but i am not even sure what i want to do below is possible and i am hoping someone that knows a lot more than me can say if it is possible before i look at buying Prodigy.

I wanted to know if it would be possible to use NER and machine learning to parse filenames to extract all the information. My though is along the lines of collecting lots of examples of different types of filename used for different purposes and having a model and NER trained for the different types that could be trained by users of the system manually annotating the filenames not supported.

I have tried to do this with regex with things like tv series filenames from torrent sites as it gives a nice large set of input data as well as a lot of variations that are easily readable to the human eye and you can spot the patterns but for regular expressions it means you have 50 plus expressions and it still does not cover everything. It also meant the idea was a failure as that was just one file type and i was hoping for something more useful. I did find it interesting because it involves multi languages and patterns for the same data affected by sub type, source and a number of other factors.

An example of what i want to do is something like. (yes i want all the info not just the series name)
‘MasterChef Australia S10E53 480p x264-mSD’
i would get things like Series: MasterChef Australia Seasion:10 Episode:53 Res:480p VideoCoded:x264 Group:mSD

A few examples of some of the variations
[Tsundere] To Aru Majutsu no Index - 12 [BDRip h264 1280x720 10bit FLAC][8FB6594C]
[Commie] Fune wo Amu - 06 [BD 720p AAC] [D644AC2F]
(Hi10) True Tears 05 (BD 720p)
(Nogizaka46) NOGIBINGO! - ノギビンゴ (Season 1-6)
12 Monkeys Season 4 Complete 720p HDTV x264 [i_c]
Аванпост / The Outpost [01x01-02 из 10] (2018) WEB-DLRip | Jaskier
Adam DeVines House S01E01 1080p WEB x264-KLINGON
Adam DeVines House - 01x01 1080p WEB x264.mkv
Adam_DeVines_House_-01x01[1080p WEB x264]
Angels Of Death S01E01 Kill Me Please DUBBED WEB x264-DARKFLiX

I can find a heap stuff for learn by example and NER but most if not all are based on language and normal sentences not something like filenames. A lot of the samples and info on the Prodigy website make it seem like it may be the tool that could help a noob like me maybe do this sort of thing if Prodigy and the underlying NER and machine learning tools could support it but i am not sure if it is even possible.

I can find samples to detect the type of flowers in images etc or objects but i can’t find anything for finding patterns in strings and tagging the data based on those patterns if it is not based on normal language and sentences.

My other concern is a lot of the NER training examples i have seen tag based on words that you list in training not really on the pattern of the data the string and its position or characters around it and other categories of words around it.

Is something like this possible and is this the right tool to be looking at to start to learn about it in if it is possible?


In theory, maybe this would work. I’m not sure though — it’s very different from what the NER is set up to do, and so I think you’ll hit situations where you’re fighting built-in assumptions. You would also have to change the tokenization rules significantly. By default, S01E01 would be marked as a single token, while it has two pieces of information for you — so you’ll have to split that up. I think in general the tokenization will be a big part of the problem.

My advice would be to try getting the tokenization right with spaCy, or a different tool if you have one that’s easier to work with for your use-case. If you can tokenize, then you might find spaCy’s Matcher rules a better fit for your problem:

The trick will be to define your own lexical attributes, so that you have variables to reference to make your patterns easier to write.

It’s possible Prodigy would help you train an NER system, but you’d still want to use it in combination with a lot of rules — so it’s probably not the best place to start.

P.S. This is a lot of trouble to go to organize the ID3 tags for your pirate media :p. Surely there’s a better way.

Thanks for the feedback. It gives me a better starting point than i had before. While this could be good for my personal media that is not really my goal. It just seemed like the easiest test for the idea. I was hoping that if it was at all successful to make it more a service as even at work we have 1,000s of batch files flying everywhere that all have patterns to them where more info could be extracted and stored as well as other use cases.

My hope was to get a component that can learn by example and then build a system around it to really get those examples from people etc and to auto train and retrain the system as it goes. It sounds like it is quite hard to do and i guess that is why no one has done it compared to things like object recognition in images etc.