How to merge two contiguous entities into a single entity?

damiano · March 26, 2018, 10:20pm

Hello,
i have a big dictionary of PERSON names/surnames. Prodigy can annotate the persons via the PhraseMatcher but i would like to create ONE single entity if two or more entites are contiguos. Is that possible somehow?
Ex.
My name is John Smith I was born…
at the moment i have two annotations “John” and “Smith”, how can i create one entity with both?

Thank you

ines · March 27, 2018, 8:27am

It might make sense to do this as a post-process, actually – i.e. after you’ve collected the annotations. So, you use the PhraseMatcher and accept all contiguous entities. When you’re done, you export the data, iterate over the spans and compare the "start" and "end" indices of the spans to determine whether they’re contiguous or not (don’t forget that there’s a space character in between!). If they are, you replace the two spans with one that spans over the whole entity.

If you process your data with the prodigy.components.preprocess.add_tokens preprocessor, you’ll also get a "token_start" and "token_end" property on each span, which might be even easier to compare than the character offsets.

damiano · March 27, 2018, 9:08am

Hi Ines!
so before using the annotations to train my custom model i should create a custom script that merge the entities looking at their boundaries. Ok it makes sense.

@ines one clarification about PhraseMatcher. I was wrong, i am using the Matcher via --patterns of ner.teach. Now the problem is that i have multi words patterns, I am matching cities. So what is the correct way to use such dictionary? Should i create patterns like:

{"label": "CITY", "pattern": [{"lower": "new"}, "lower": "york"} ]}

(i have to deal with strange cases, I have NEW YORK, New York, new york, NEw YORk and so on… it is the result of an extraction so the output is not clean.)

Second problem is that i have around 100.000 cities, so…should i create 100.000 patterns?

ines · March 27, 2018, 9:25am

Ah, sorry, I was confused, too – I meant the PatternMatcher, which is Prodigy's built-in matcher that supports both phrase and token patterns.

Yes, in that case, matching on "lower" is definitely the best option. 100.000 patterns are too much for the token matcher, though – you can still try it, but it'll likely be way too slow. The more efficient solution would be to use phrase patterns, which are also supported by Prodigy:

{"label": "CITY", "pattern": "new york"}

However, phrase patterns will only match the exact string. Since the phrase matcher is more efficient, you can easily add several versions of the string programmatically – e.g. uppercase, lowercase and titlecase. But this will still make cases like "NEw YORk" difficult.

To work around this, you could use a little trick and convert all texts to lowercase before annotating them with Prodigy – just make sure you keep a copy of the original text somewhere in your task. A single example could then look like this:

{"text": "i like new york", "orig_text": "I like New York"}

The matcher will now correctly match all cities. The character offsets you annotate won't change with the capitalisation, so before you train your model, you can simply loop over your exported annotations and replace the "text" with the "orig_text" (original capitalisation):

new_example = {'text': example['orig_text']}

damiano · March 27, 2018, 9:29am

Awesome @ines! Thank you really much

damiano · March 30, 2018, 7:37am

@ines i am following your advice about the lower case text and the original text in orig_text

It works perfectly but i would like to avoid re-analyzing the entire dataset. Is there a callback to store the orig_text in dataset immediately after annotation?

ines · March 30, 2018, 9:21am

Hmm, let me think about the best way to solve this! By design, each record in the database should reflect exactly what you’ve annotated – this is also why we generally don’t recommend modifying the records before storing them in the database. You always want to keep a reference to what the annotators saw on the screen, and not just your modified version of it.

But one thing you could do is add a custom update callback to your recipe (which is called every time the server receives new annotations). The function could modify the examples in there and store them in a new dataset. Or you could implement this on_exit and only convert the annotations from the current session. For example:

def on_exit(ctrl):
    session_data = ctrl.db.get_dataset(ctrl.session_id)
    for eg in session_data:
        # modify the example here
    ctrl.db.add_examples(session_data, datasets=['other_dataset'])

This would leave you with two datasets – but it’d also mean that if something goes wrong, or you want to post-process your examples differently, you can always go back and do that (and won’t accidentally destroy your data).

Topic		Replies	Views
Can I combine token and phrase matcher?	1	425	August 4, 2022
Question about EntityRecognizer usage , ner	5	813	July 29, 2020
Merging datasets of same input data to combine separately annotated entities usage , ner	2	17	February 17, 2025
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	483	August 7, 2019
trying to link words in two spans to form 1 entity in prodigy. usage , ner	1	967	April 19, 2019

How to merge two contiguous entities into a single entity?

Related topics