How to merge two contiguous entities into a single entity?

Hello,
i have a big dictionary of PERSON names/surnames. Prodigy can annotate the persons via the PhraseMatcher but i would like to create ONE single entity if two or more entites are contiguos. Is that possible somehow?
Ex.
My name is John Smith I was born…
at the moment i have two annotations “John” and “Smith”, how can i create one entity with both?

Thank you

It might make sense to do this as a post-process, actually – i.e. after you’ve collected the annotations. So, you use the PhraseMatcher and accept all contiguous entities. When you’re done, you export the data, iterate over the spans and compare the "start" and "end" indices of the spans to determine whether they’re contiguous or not (don’t forget that there’s a space character in between!). If they are, you replace the two spans with one that spans over the whole entity.

If you process your data with the prodigy.components.preprocess.add_tokens preprocessor, you’ll also get a "token_start" and "token_end" property on each span, which might be even easier to compare than the character offsets.

Hi Ines!
so before using the annotations to train my custom model i should create a custom script that merge the entities looking at their boundaries. Ok it makes sense.

@ines one clarification about PhraseMatcher. I was wrong, i am using the Matcher via --patterns of ner.teach. Now the problem is that i have multi words patterns, I am matching cities. So what is the correct way to use such dictionary? Should i create patterns like:

{"label": "CITY", "pattern": [{"lower": "new"}, "lower": "york"} ]}

(i have to deal with strange cases, I have NEW YORK, New York, new york, NEw YORk and so on… it is the result of an extraction so the output is not clean.)

Second problem is that i have around 100.000 cities, so…should i create 100.000 patterns? :smiley:

Ah, sorry, I was confused, too – I meant the PatternMatcher, which is Prodigy’s built-in matcher that supports both phrase and token patterns.

Yes, in that case, matching on "lower" is definitely the best option. 100.000 patterns are too much for the token matcher, though – you can still try it, but it’ll likely be way too slow. The more efficient solution would be to use phrase patterns, which are also supported by Prodigy:

{"label": "CITY", "pattern": "new york"}

However, phrase patterns will only match the exact string. Since the phrase matcher is more efficient, you can easily add several versions of the string programmatically – e.g. uppercase, lowercase and titlecase. But this will still make cases like “NEw YORk” difficult.

To work around this, you could use a little trick and convert all texts to lowercase before annotating them with Prodigy – just make sure you keep a copy of the original text somewhere in your task. A single example could then look like this:

{"text": "i like new york", "orig_text": "I like New York"}

The matcher will now correctly match all cities. The character offsets you annotate won’t change with the capitalisation, so before you train your model, you can simply loop over your exported annotations and replace the "text" with the "orig_text" (original capitalisation):

new_example = {'text': example['orig_text']}

Awesome @ines! Thank you really much

@ines i am following your advice about the lower case text and the original text in orig_text

It works perfectly but i would like to avoid re-analyzing the entire dataset. Is there a callback to store the orig_text in dataset immediately after annotation?

Hmm, let me think about the best way to solve this! By design, each record in the database should reflect exactly what you’ve annotated – this is also why we generally don’t recommend modifying the records before storing them in the database. You always want to keep a reference to what the annotators saw on the screen, and not just your modified version of it.

But one thing you could do is add a custom update callback to your recipe (which is called every time the server receives new annotations). The function could modify the examples in there and store them in a new dataset. Or you could implement this on_exit and only convert the annotations from the current session. For example:

def on_exit(ctrl):
    session_data = ctrl.db.get_dataset(ctrl.session_id)
    for eg in session_data:
        # modify the example here
    ctrl.db.add_examples(session_data, datasets=['other_dataset'])

This would leave you with two datasets – but it’d also mean that if something goes wrong, or you want to post-process your examples differently, you can always go back and do that (and won’t accidentally destroy your data).