Custom Span Categorizer - Linebreaks?

If a span that you are labelling has line-breaks as part of the string (eg: "Mr. Peter \n Johnson") and I want to annotate "Peter \n Johnson", does this pose an issue at all? I guess I am wondering if line-breaks or any other special characters should be avoided when labelling for spancat?

Hi! By default, Prodigy sets "allow_newline_highlight": false to automatically prevent newlines from being includes in entities and spans. For a lot of use cases (especially named entity recognition), this makes sense so you don't end up with inconsistent spans with whitespace. But I guess for span categorization, it's a bit less important so we should probably change the default for this recipe.

If you set "allow_newline_highlight": true in your prodigy.json, you should be able to highlight line breaks in the UI.

Thank-you! I did some testing with \n stripped out and I'm actually quite happy with the results with them out. That being said, I may try adding them back in, just to see what happens. Really loving prodigy + Spacy, great work! Thanks again for the explanation.

1 Like