I am streaming in txt documents into the prodigy for my team to manually label some custom named entities. The formatting in text document is pretty clean however its terrible in the prodigy probably because of the tokenization. Is there any to make the formatting better?
Hi! What do you mean by “bad formatting” – do you have an example?
The main purpose of the pre-tokenization is to allow faster highlighting and also to flag potential problems and mismatches as early as possible in the development phase. If you’re planning on training a model later on, this is relevant because your tokenization should match the entities you want to predict.
If you want to tokenize the text differently, you can either plug in a customised model with different tokenization rules, or provide your own tokenization via the
"tokens" property in the data. Each token should be a “highligtable unit”. You could also mark tokens as
"disabled": true to make them unselectable. (For example, puntucation or other words that should never be part of an entity. This can significantly reduce human error.)
Here you can see the txt and how its displayed in spacy this would be difficult for person to annotate:
from os import listdir
from os.path import join
for quote_file in listdir(text_dir)[:50]:
full_path= join(text_dir, quote_file)
file_text = “”
with open(full_path, encoding=‘utf-16’) as f:
file_text = f.read()
if name == “main”:
data_dir = “Management Liability Documents”
Thanks for the example – I understand what you mean now! The thing here is that the
ner_manual interface operates on raw tokenized text and it also doesn’t use a fixed-width font by default (whereas your document relies on that for alignment).
- Change the main theme font to a monospace font and/or use the card CSS or global CSS option to change the main display font of the text. This will make all characters the same width so alignment using spaces actually works.
- You might also want to increase the card width or decrease the font size to fit more text on one line.
- Newline tokens currently need to be single tokens containing only the newline (not several or something else). We might be able to relax that if the token contains only newlines, but for now, try replacing multiple newlines with single newlines. Later on your can also customise the tokenization rules or supply your own tokens via the
That said, the
ner_manual interface is definitely optimised for natural language, written text and named entity recognition rather than highlighting sequences in tabular data.
Btw, are you able to share what exactly you’re looking to label? Because your data is so structured and you’re parsing it anyways, I’m wondering if there are some clever tricks you can use to automate some of it or make it even easier for the annotators.
Hello Ines, I can share them is their and email I could send them to you. Basically I am trying to do NER for tabular data as well as text data. For Tabular data spacy seems to fall apart, we are experimenting with a faster rcnn and prodigy yolo as and annotator for the position file for tabular object dectection with some conversions to massage it into Luminoth. But I always wondered if there is a more clever way to extract some tabular data by manual tagging to create a training file or some clever semantic linking or core reference but we are a small startup and dont have infinite time or money to play around. Any thoughts on NER for tabular data as it pertains to prodigy/spacy would be helpful.
It’s true that a standard pre-trained model that was trained on web and newspaper text and natural language in general probably doesn’t do very well on tabular data out-of-the-box.
I’ve seen some approaches that framed a similar task as a computer vision problem and would predict bounding boxes and then apply a separate OCR step to extract the text. This could also be pretty straightforward in terms of labelling, because your annotators only need to draw boxes around the text. So this is definitely something you could try!