convert NER tagged to XML file and define a new hierarchical category

Hi Guys,

@ines
I Miss you guys :slight_smile: I really want to appreciate all you have done, I really enjoy using prodigy for my project.

If you remember, I have done a custom named entity recognition on a custom entity such as

  • TIME

  • DATE

  • PARA

  • ASTR (astronomical name)

  • LONG (longitude)

  • STAR

  • PLAN (planets)

  • NAMES

  • GEOM (geometrical name)

I have made an interactive named entity recognition with the colorful tags:

then, based on LONG, DATE, TIME ...I could extract explicit observation of the whole corpus, my structured
data look like this now

Now, generally speaking, I want to do two things

1-I want to find a way to save these tagged sentences in an optimum way for instance in an XML file

2-I want to define kind of hierarchy between tags

as an example

for this sentence

'On 1582 December 28 at 11h 30m, they set Mars down at 16° 47’ Cancer by observation'

which has been tagged in this way:

I want either have such XML file:

 On <date> 1582  December 28 </date> at<TIME> 11h 30m </time>,
they set <Plan>Mars</Plan> down at <LONG>16° 47’ Cancer
</LONG>> by
observation ^6.

or much better and smarter with new hierarchy:

<observation>
On <date> <day>1582  December 28</day>  at<hour> 11h 30m </hour></date>,
they set <object>Mars</object> down at <quantity><long>16° 47’ Cancer
</long></quantity> by
observation ^6.
</observation>

as you see in tagging, here I have added some entities

  • observation
  - object

  - date

      - hour

      - day

  - quantity

      - long

      - lat

      ...

I think the first one is more possible but I would be happy with your suggestion. Even if you have another better ways to my two aims (saving, defining hierarchy) please let me know

Hi @robertto,

Thanks for the kind words :). I'm glad to hear your project has been going well.

We don't really have much advice about the choice of data format. Either of your suggestions seem fine, and it should be easy to make transformations between them if you want to change it later. You'll also always have the original data, so there should be no way to lose information.

@honnibal

thank you for your comment,

using spacy attribute and search in ent, I could manage to solve the first thing in this way:


@interact
def ShowDetail(column='SentIndex', x=(0,len(df)-1)):
    s=df.iloc[x,4];
    doc=nlp(s)
    displacy.render(doc,style="ent",jupyter="True",options=options )
    for ent in reversed(doc.ents):
        #print(ent.text, ent.start_char, ent.end_char, ent.label_)
        replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
        position = ent.start_char
        length_of_replaced = ent.end_char - ent.start_char 
        s = s[:position] + replacement + s[position+length_of_replaced:]
    return s

enter image description here

this give me xml format, but I do not how to add new nested category to my text

I am not sure that I can convert my text to my second suggestion, any idea? should I define a new training set (NER )using spacy or there is an easier way?

I mean need to have somethink like this for my entities

?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

which means this my context:

<observation>
On <date> <day>1582  December 28</day>  at<hour> 11h 30m </hour></date>,
they set <object>Mars</object> down at <quantity><long>16° 47’ Cancer
</long></quantity> by
observation ^6.
</observation>

I just want to let you know that first suggestion format has been solved, then any idea regarding the second one (nested entities based my workflow) would be very appreciated.

Many thanks

hai, sir please send this code. i'm also facing this problem. I have XML file but how to get exactly output.
this is my mail id: itsmehari402@gmail.com

pls send me the code i'm also facing this issue