I would like to do entity linking using wikidata, but would like to allow the annotator to search wikidata and enter whichever ID they decide is the right one, rather than giving them a list to choose from.
I think this boils down to a span categorization, but without a pre-defined list of labels. Each span would have a text box instead. Or it could be an NER task with an extra text box in addition to the usual NER labels.
I would also like to train an open relation extraction model, so I would also like to do relation annotation with text boxes rather than pre-defined lists of relation types.
From the docs it looks like almost everything uses a pre-defined list of labels. How can I make span annotation without a pre-defined list work? Are there any examples I can look at?
If you would like to recreate a workflow similar to @SofieVL 's nel demo but instead of pre-populated options use a free form text, you will want to combine ner annotation interface with the text_input annotation interface view via blocks
On the margin, let me add that there's much to be said for providing the users with the pre-populated options, though. It will speed up the annotation significantly, improve the annotators UI and reduce the number of errors resulting from typos etc. And you will certainly want to validate the answers via validate_answer callback if what you're looking for are valid Wikidata IDs.
Generally, it's advisable to reserve free form input for situations where no alternative is available, such as when gathering subjective opinions or transcribing audio. Allowing annotators to define labels on the fly is likely to produce datasets that are unsuitable for machine learning purposes. While there may be exceptions, tasks like entity linking and relation extraction probably don't fall into this category.
Without a predefined annotation schema and clear guidelines, the resulting dataset could suffer from significant inconsistencies, as each annotator may approach the task with their own schema.
It's considered best practice to establish the data model and categories you wish to identify from the outset. This is why supervised learning recipes typically assume the existence of a predefined label set.
The text block sounds like a good solution, the only problem is that IINM the entity linking demo only gives one sentence at a time. Often, the annotators will need much more context, up to and including the whole document, to figure out which entity a given name is referring to. For instance, commonly an article might start giving a person's full name, but in subsequent sentences you will only get a first or last name. Without the context there is no way to tell which record is the correct link.
Please note that the purpose of this annotation is to create a ground truth dataset, not to directly train a model, so I need annotators to take all context into account, not just the given sentence. This is also why it will not help me to get "None of the above" answers.
Anyway, to summarize, is there a way to show the full document text with a straightforward modification to the EL demo? If so this might be just what I need.
The other issue I am uncertain about is: the NER interface allows annotation of any number of entities, does it not? In that case, just attaching a text input block would seemingly not work, because there would be one text input block per example, but there could be multiple entities per example? Is this correct or have I misunderstood something?
Anyway, to summarize, is there a way to show the full document text with a straightforward modification to the EL demo? If so this might be just what I need.
The length of input examples can be anything you want. The demo just happens to use one sentence length inputs for simplicity but you can use as much context as required. In other words, the length of the input depends on how you prepare your input file, the demo code does not split the input into sentences or modify it in any way except for adding the NER annotations.
The other issue I am uncertain about is: the NER interface allows annotation of any number of entities, does it not? In that case, just attaching a text input block would seemingly not work, because there would be one text input block per example, but there could be multiple entities per example? Is this correct or have I misunderstood something?
There of course can be many NER labels per text. For the EL phase of the annotation you want to pre-process your examples to show one NER label at a time, so that the annotators take one decision at a time. This means they will see the same text multiple times (each time with a different NER label) but it is a much more effective way of doing EL annotation. The UI is cleaner and decision process is more focused.
This is a very common procedure so Prodigy provides a helper for this: Components and Functions · Prodigy · An annotation tool for AI, Machine Learning & NLP
So to build on the example of the modified demo code I provided in this other thread, you'd use this helper like so:
from prodigy.components.preprocessing import split_spans
# Make sure there's one NER mention per task
stream.apply(split_spans, stream=stream) #UPDATED to use the newer API
# For each NER mention, add the candidates from the KB to the annotation task
stream.apply(_add_options, stream=stream, kb=kb, id_dict=id_dict) #UPDATED to use the newer API
stream.apply(filter_duplicates, stream=stream, by_input=False, by_task=True) #UPDATED to use the newer API