Missing full stop when using "dep.correct"

joseph8198 · May 12, 2022, 2:06am

I am trying to create an CoNLL-U format data using multiple tagged dataset with different task(pos, ner, dep, etc).
My first attempt is to tag data separately (by different task), and combine them into single CoNLL-U dataset. However, results from recipe "dep.correct" will always missing a full stop at the end of sentence, as shown as below.

The data is retrieved using "db-out" and printout "text".
The comparison is taken from the result for "pos.correct"(top) and "dep.correct"(bottom)
The command I used for both task is: "dep.correct(pos.correct) <database_name> en_core_web_sm <dataset.txt> --unsegmented"
The full stop missing issue can be solved by dropping the "unsegmented" argument, but this argument is needed for my case.
Since I am combining datasets to form a single dataset, the "text" and "token" section of the tagged datasets need to be identical.

How do I solve the issue? Is there any method to perform multiple tagging (pos, ner, dep, relation) in the same session on same dataset?

Thanks

joseph8198 · May 12, 2022, 4:02am

I kinda solve the problem by changing line 113 on dep.py from "sents = [doc[: len(doc) - 1]] if unsegmented else doc.sents" to "sents = [doc[: len(doc)]] if unsegmented else doc.sents"
Does this line important for some reason? If not, I think I will keep the changes.

So, my remaining question is: can I do multiple tagging (POS, DEP, Relation, etc) on the same dataset in the same session?

Thanks

Topic		Replies	Views
`ner.correct` doesn't show the full text usage , ner , solved	4	411	March 10, 2021
consolidating unsegmented and segmented annotations usage , ner	2	664	February 14, 2022
Unsplitting annotated sentences ner , spacy	1	285	June 23, 2022
Cannot debug Annotation Data to Train NER model. ner , spacy	4	1891	October 7, 2020
Skip mismatched tokenization? usage , ner , spacy , solved	2	395	February 8, 2022

Missing full stop when using "dep.correct"

Related topics