Hello! I am trying to annotate named entities that are not contiguous and may overlap. Examples might be: "Type 1 and Type 2 diabetes" => {"Type1 diabetes", "Type2 diabetes"}, or "database design, development, and support" => {"database design", "database development", "database support"}.
Hi! Non-contiguous and potentially overlapping spans can definitely be a little tricky, especially if you want the annotation process to be clear and efficient.
One option could be to just model this as a relation annotation task instead of span annotation and only focus on connecting tokens with the given entity label. For example, in your case, you would annotate "Type" + "1" → "diabetes" and "Type" + "2" → "diabetes". This way, a token or entity fragment can be part of multiple spans of different types and it'll be clearly visualised in the UI with different colours.
Sorry if this sounds a bit abstract – but I made a quick example Initially, your sentence would look like this:
To merge expressions like "Type 1" that are part of the non-contiguous span, you can use the span highlighting mode. I used the label X here because the label doesn't matter – we'll be annotating that at the relation level so that fragments can be part of multiple, potentially different entity types if needed:
The resulting JSON data (see here for an example) will include each annotated relations and the two fragments they connect, with their token indices and offsets into the text, as well as the label. This should make it pretty easy to export the information in the format you're looking for.
The only constructions that would be difficult to express with this approach are cases where you have nested expressions that you want to treat as separate entities (e.g. "Type 2 diabetes" and "Type 2 diabetes research") – but I'm not even sure that this actually makes sense conceptually.