I will be honest and say i am very new machine learning and NER. I have played with the classifiers for and found the quite easy to classify my string as a whole to a type but i am not even sure what i want to do below is possible and i am hoping someone that knows a lot more than me can say if it is possible before i look at buying Prodigy.
I wanted to know if it would be possible to use NER and machine learning to parse filenames to extract all the information. My though is along the lines of collecting lots of examples of different types of filename used for different purposes and having a model and NER trained for the different types that could be trained by users of the system manually annotating the filenames not supported.
I have tried to do this with regex with things like tv series filenames from torrent sites as it gives a nice large set of input data as well as a lot of variations that are easily readable to the human eye and you can spot the patterns but for regular expressions it means you have 50 plus expressions and it still does not cover everything. It also meant the idea was a failure as that was just one file type and i was hoping for something more useful. I did find it interesting because it involves multi languages and patterns for the same data affected by sub type, source and a number of other factors.
An example of what i want to do is something like. (yes i want all the info not just the series name)
‘MasterChef Australia S10E53 480p x264-mSD’
i would get things like Series: MasterChef Australia Seasion:10 Episode:53 Res:480p VideoCoded:x264 Group:mSD
A few examples of some of the variations
[Tsundere] To Aru Majutsu no Index - 12 [BDRip h264 1280x720 10bit FLAC][8FB6594C]
[Commie] Fune wo Amu - 06 [BD 720p AAC] [D644AC2F]
(Hi10) True Tears 05 (BD 720p)
(Nogizaka46) NOGIBINGO! - ノギビンゴ (Season 1-6)
12 Monkeys Season 4 Complete 720p HDTV x264 [i_c]
Аванпост / The Outpost [01x01-02 из 10] (2018) WEB-DLRip | Jaskier
Adam DeVines House S01E01 1080p WEB x264-KLINGON
Adam DeVines House - 01x01 1080p WEB x264.mkv
Adam_DeVines_House_-01x01[1080p WEB x264]
Angels Of Death S01E01 Kill Me Please DUBBED WEB x264-DARKFLiX
I can find a heap stuff for learn by example and NER but most if not all are based on language and normal sentences not something like filenames. A lot of the samples and info on the Prodigy website make it seem like it may be the tool that could help a noob like me maybe do this sort of thing if Prodigy and the underlying NER and machine learning tools could support it but i am not sure if it is even possible.
I can find samples to detect the type of flowers in images etc or objects but i can’t find anything for finding patterns in strings and tagging the data based on those patterns if it is not based on normal language and sentences.
My other concern is a lot of the NER training examples i have seen tag based on words that you list in training not really on the pattern of the data the string and its position or characters around it and other categories of words around it.
Is something like this possible and is this the right tool to be looking at to start to learn about it in if it is possible?
Regards,
Chris