Two word NER

Hi everyone!

I have a problem when I want to do the NER for two-word entities, for example, :‘crude oil’.And I read the following discussion

Then I wrote a python script to transform a list of terms (text file) to the patterns(Jsonl file). I don’t know it is right or not, but anyway, I put my code below:

'----------------------------Input Parameter-------------------------------------'

#import argparse
#parser = argparse.ArgumentParser(description='Input the Parameter')

filename = input('Please input the text file name,example:stock.txt ')
LABEL=input('Please input the label,example:STOCK ')
patterns_name=input('Please input the name of jsonl file,example:stock_patterns.jsonl ')

#filename = 'stock.txt'
#LABEL='STOCK'
#patterns_name='stock_patterns.jsonl'

'------------------Read the text file and Lower case the letter----------------'
result=[]
file = open(filename,"r",encoding="utf-8",errors="ignore")
while True:
    mystr = file.readline()#read line by line
    result.append(mystr)
    if not mystr:
        result.pop()
        file.close()
        break
    
'--------------------build the dictionary--------------------------------------'

final=[]
dictionary={}
item_list=[]
pattern_list=[]
for item in result:
    if ' ' in item:
        item_list=item.split(' ')
        for element in item_list:
            element=element.strip('\n')
            pattern_list.append({'lower':element})
        dictionary={'label':LABEL,'pattern':pattern_list}
        item_list=[]
        pattern_list=[]
        final.append(dictionary)
    else:
        item=item.strip('\n')
        dictionary={'label':LABEL,'pattern':[{'lower':item}]}
        final.append(dictionary)

'--------Transform the file to JSON--------------------------------------------'

import json
with open(patterns_name, 'w') as f:
    for item in final:
        json.dump(item, f)
        f.write('\n')
'------------------------------------------------------------------------------------------'

This script will give me a result similar to the prodigy command:

prodigy terms.to-patterns

But when I use the patterns to do the ner. match, with the command:

prodigy ner.match commodities_ner en_core_web_sm commodities_dataset.jsonl --patterns commodities_patterns.jsonl

I received the following error:

Traceback (most recent call last):
  File "cython_src/prodigy/components/feeds.pyx", line 130, in prodigy.components.feeds.SessionFeed.get_session_stream
  File "/home/ec2-user/.local/lib/python3.7/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/waitress/channel.py", line 338, in service
    task.service()
  File "/usr/local/lib/python3.7/site-packages/waitress/task.py", line 169, in service
    self.execute()
  File "/usr/local/lib/python3.7/site-packages/waitress/task.py", line 399, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/usr/local/lib/python3.7/site-packages/hug/api.py", line 423, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "/usr/local/lib/python3.7/site-packages/hug/interface.py", line 793, in __call__
    raise exception
  File "/usr/local/lib/python3.7/site-packages/hug/interface.py", line 766, in __call__
    self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/hug/interface.py", line 703, in call_function
    return self.interface(**parameters)
  File "/usr/local/lib/python3.7/site-packages/hug/interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "/usr/local/lib64/python3.7/site-packages/prodigy/app.py", line 105, in get_questions
    tasks = controller.get_questions()
  File "cython_src/prodigy/core.pyx", line 109, in prodigy.core.Controller.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 61, in prodigy.components.feeds.SharedFeed.get_next_batch
  File "cython_src/prodigy/components/feeds.pyx", line 137, in prodigy.components.feeds.SessionFeed.get_session_stream
ValueError: Error while validating stream: no first example. This likely means that your stream is empty.

I tried to change the dataset,so there is no error,but the website stay in LOADING…

May I ask if I am doing something wrong?If this method is not feasible,how can I deal with the two-word entities?

Thanks!!

Hi! The error message you’re seeeing indicates that the stream is empty, likely because no matches were found in the dataset (which means there’s nothing to show you).

At first glance, your script looks okay – but could you post some examples of the patterns file it created? And are all the terms you load in lowercase? You’re using {'lower':element} to look for token whose lowercase forms match element, but the script doesn’t ensure that the token is actually lowercase. So if you end up with {"lower": "Oil"}, your pattern will never match.

Also, one tip to improve you script: It might not be relevant for most of your examples, but instead of just running item.split(' ') and only splitting on spaces, you might want to use spaCy’s tokenizer to split the phrases the same way spaCy would. After all, each entry in the patterns represents one token – and the tokenizer does more than just splitting on whitespace. For example, spaCy would split "crude-oil" as ["crude", "-", "oil"], whereas your logic would keep it as one token and produce a pattern that never matches.

Here’s an example:

# at the top of your script
import spacy
nlp = spacy.load('en_core_web_sm')

# in your pattern logic

for item in result:
    # only tokenize, so it's faster
    doc = nlp.tokenizer(item)
    # use the token's lower_ attribute to generate the pattern
    pattern = [{'lower': token.lower_} for token in doc]
    # and so on

Thank you very much for your help, now prodigy can work perfectly and recognize the identity of two words!Capture

1 Like