I want to use Prodigy to manually annotate with
manual (but possibly also use
teach) with NERs at the character-as-token (as opposed to word-as-token) level. This is for use with a very specific dataset that often runs “words” together and heavily abbreviates them. For example, the phrase “Two Green Lamborghini Gallardos and Four Red Ferrari Enzos” might be written as “2LAMBGDOSRD, 4FARRENZGRN”.
I assume I can do the basic annotations (i.e.
manual) part by replacing the SpaCy tokeniser in the pipeline with something that just splits at characters?
Also what would it take for the
teach part to work? I’m assuming I need the above, plus vectors trained on my dataset at the character (plus character-gram?) level - plus something to replace the dependency parser (maybe giving every token a fixed POS like “CHAR” - and making each character the child of the last)?
Does this sound remotely plausible?