Hello, thanks for the insight.
I’m a bit lazy to describe where I put some print statements so I’ll post the __call__
with the print
statements I made and some other adjustments
:
def __call__(self, doc):
doc.tensor = numpy.zeros((0,), dtype='float32')
matches = self.matcher(doc)
spans = []
for matchid, start, end in matches:
entity = Span(doc, start, end, label=matchid)
print('Found: ' + entity.text)
try:
spans.append((entity, entity.text))
for token in entity:
token._.set('is_finding', True)
doc.ents = list(doc.ents) + [entity]
except Exception as e:
print('Warning: ' + str(e))
continue
print('Available Spans: %s'%spans)
for s, txt in spans:
newtok = s.merge('NN',s[-1].lemma_,s.label_)
if newtok:
print('Merging: ' + newtok.text)
else:
print('Merging failed: ' + txt)
return doc
Here’s normal use within python terminal:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
>>> doc = nlp('no free fluid or fluid collection within pelvis')
Found: free fluid
Found: fluid
Warning: [E098] Trying to set conflicting doc.ents: '(1, 3, 'FIND')' and '(2, 3, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Found: fluid
Found: fluid collection
Warning: [E098] Trying to set conflicting doc.ents: '(4, 5, 'FIND')' and '(4, 6, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Available Spans: [(free fluid, 'free fluid'), (fluid, 'fluid'), (fluid, 'fluid'), (fluid collection, 'fluid collection')]
Merging: free fluid
Merging failed: fluid
Merging: fluid
Merging: fluid collection
>>> print([t.text for t in doc])
['no', 'free fluid', 'or', 'fluid collection', 'within', 'pelvis']
>>> print(nlp.pipe_names)
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
From Prodigy
terminal:
(base) C:\Users\carlson.hang\Desktop\Code\DepTraining\Trainer\flaskr\data>python -m prodigy dep.teach test en_core_web_sm ./testdata.txt -U
loading data for finding
loading data for pathology
loading data for anatomy
loading data for negative
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
['tagger2', 'measure', 'finding', 'pathology', 'anatomy', 'negative', 'phrase', 'parser']
Added dataset test to database SQLite.
? Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
10:17:26 - Task queue depth is 1
Found: free fluid
Found: fluid
Warning: [E098] Trying to set conflicting doc.ents: '(1, 3, 'FIND')' and '(2, 3, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Found: fluid
Found: fluid collection
Warning: [E098] Trying to set conflicting doc.ents: '(4, 5, 'FIND')' and '(4, 6, 'FIND')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
Available Spans: [(free fluid, 'free fluid'), (fluid, 'fluid'), (fluid, 'fluid'), (fluid collection, 'fluid collection')]
Merging: free fluid
Merging failed: fluid
Merging: fluid
Merging: fluid collection
And unfortunately, here are the screenshots:


For this test, the only text in that file is the one shown. Again, I skip over the entity conflictions to prioritize labeling and merging the larger phrase. And my factory seems to work exactly as it would work in normal use. Very confusing as to how Prodigy
determines what candidates to show.
If there’s anything else you would like me to test, let me know.
Currently, this is somewhat halting my project (unless I go back to manually labeling data), and I might try to change my solution to depend on single word tokens and utilize NER
more, but I feel that having merged tokens (especially in medical terminology) and training word vectors on merged tokens would definitely be a good approach to my problem.
Edit: I also use a custom tokenizer
class if that matters, it doesn’t seem to affect normal usage, but it might affect Prodigy
? It essentially uses all the en
defaults except I don’t split on ‘-’. I also realized all my factories utilizes lemmas in a way, but just to test something without lemmas, here’s a rule from my measure
factory:
[{'LIKE_NUM': True}, {'LOWER': 'by', 'OP': '?'}, {'LOWER': 'x', 'OP': '?'}, {'LIKE_NUM': True}, {'LEMMA': 'centimeter'}]
And changed it to this without using lemma:
[{'LIKE_NUM': True}, {'LOWER': 'by', 'OP': '?'}, {'LOWER': 'x', 'OP': '?'}, {'LIKE_NUM': True}]
Here’s the result (ommiting the other stuff):
>>> print([t.text for t in doc])
['2 x 2', 'centimeter', 'free fluid', 'in', 'abdomen']

Do the above rules require the tagger
too? I must be doing something wrong with Prodigy
, but I’m not sure what that would be.