Bad formatting in gui for manual tagging

rkeyvani · March 7, 2019, 6:32pm

I am streaming in txt documents into the prodigy for my team to manually label some custom named entities. The formatting in text document is pretty clean however its terrible in the prodigy probably because of the tokenization. Is there any to make the formatting better?

ines · March 7, 2019, 6:44pm

Hi! What do you mean by “bad formatting” – do you have an example?

The main purpose of the pre-tokenization is to allow faster highlighting and also to flag potential problems and mismatches as early as possible in the development phase. If you’re planning on training a model later on, this is relevant because your tokenization should match the entities you want to predict.

If you want to tokenize the text differently, you can either plug in a customised model with different tokenization rules, or provide your own tokenization via the "tokens" property in the data. Each token should be a “highligtable unit”. You could also mark tokens as "disabled": true to make them unselectable. (For example, puntucation or other words that should never be part of an entity. This can significantly reduce human error.)

rkeyvani · March 8, 2019, 10:37am

Hello Ines,
Here you can see the txt and how its displayed in spacy this would be difficult for person to annotate:

before
after

from os import listdir
from os.path import join
import json

script
def load_text_files(text_dir):
for quote_file in listdir(text_dir)[:50]:
full_path= join(text_dir, quote_file)
file_text = “”
with open(full_path, encoding=‘utf-16’) as f:
file_text = f.read()

    print(json.dumps({"text": file_text}))

if name == “main”:
data_dir = “Management Liability Documents”
load_text_files(data_dir)

ines · March 8, 2019, 11:11am

Thanks for the example – I understand what you mean now! The thing here is that the ner_manual interface operates on raw tokenized text and it also doesn’t use a fixed-width font by default (whereas your document relies on that for alignment).

Change the main theme font to a monospace font and/or use the card CSS or global CSS option to change the main display font of the text. This will make all characters the same width so alignment using spaces actually works.
You might also want to increase the card width or decrease the font size to fit more text on one line.
Newline tokens currently need to be single tokens containing only the newline (not several or something else). We might be able to relax that if the token contains only newlines, but for now, try replacing multiple newlines with single newlines. Later on your can also customise the tokenization rules or supply your own tokens via the "tokens" property.

That said, the ner_manual interface is definitely optimised for natural language, written text and named entity recognition rather than highlighting sequences in tabular data.

Btw, are you able to share what exactly you’re looking to label? Because your data is so structured and you’re parsing it anyways, I’m wondering if there are some clever tricks you can use to automate some of it or make it even easier for the annotators.

rkeyvani · March 19, 2019, 6:34pm

Hello Ines, I can share them is their and email I could send them to you. Basically I am trying to do NER for tabular data as well as text data. For Tabular data spacy seems to fall apart, we are experimenting with a faster rcnn and prodigy yolo as and annotator for the position file for tabular object dectection with some conversions to massage it into Luminoth. But I always wondered if there is a more clever way to extract some tabular data by manual tagging to create a training file or some clever semantic linking or core reference but we are a small startup and dont have infinite time or money to play around. Any thoughts on NER for tabular data as it pertains to prodigy/spacy would be helpful.

ines · March 20, 2019, 11:13pm

It’s true that a standard pre-trained model that was trained on web and newspaper text and natural language in general probably doesn’t do very well on tabular data out-of-the-box.

I’ve seen some approaches that framed a similar task as a computer vision problem and would predict bounding boxes and then apply a separate OCR step to extract the text. This could also be pretty straightforward in terms of labelling, because your annotators only need to draw boxes around the text. So this is definitely something you could try!

rkeyvani · March 21, 2019, 8:42pm

Okay thank you Ines, last question you mentioned below a couple of suggestions to make annoation cleaner, do you have a example of this or a reference to a thread with more detail on how to set up prodigy with the custom format.

Change the main theme font to a monospace font and/or use the card CSS or global CSS option to change the main display font of the text. This will make all characters the same width so alignment using spaces actually works.
You might also want to increase the card width or decrease the font size to fit more text on one line.
Newline tokens currently need to be single tokens containing only the newline (not several or something else). We might be able to relax that if the token contains only newlines, but for now, try replacing multiple newlines with single newlines. Later on your can also customise the tokenization rules or supply your own tokens via the "tokens" property.

ines · March 22, 2019, 9:02am

Those are the "card_css" and "global_css" options in the prodigy.json or recipe config. You can find more details on that in your PRODIGY_README.html or here.

Those are the "cardMaxWidth", "largeText", "mediumText" and "smallText" settings in "custom_theme". For example:

{"custom_theme": {"cardMaxWidth": 1000}, "largeText": 12, "mediumText": 12, "smallText": 10}

This would just be a pre-processing task. So even running a simple search and replace over the raw text should do the trick.

Topic		Replies	Views
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Fully manual NER annotations without tokeniser enhancement , ner , done	3	996	June 17, 2020
Skip mismatched tokenization? usage , ner , spacy , solved	2	394	February 8, 2022
Starting with XML-tagged Corpus usage , ner , solved	2	639	June 28, 2019
Annotating strings without correct separation ner , best-practices	8	187	November 21, 2024

Bad formatting in gui for manual tagging

Related topics