Looking for a few prods in the right direction, if that's possible. I just purchased a license for Prodigy, am very excited about the possibilities, but am rubbing up against a decent amount of frustration at not quite knowing how to do things. A lot of it probably is down to this being new for me etc, hence why the longish post here.
Project goal
I want to train three models:
- one that can recognise (and therefore enable me to extract) drug names from a given text
- one that can recognise dosages (for the above medications) from a given text
- one that can recognise medical symptoms (and enable me to extract / analyse) in a given text.
I have a big text blob (c. 2m words) and I want to use Prodigy to enable me to label some of that data, so that then I can train the two models to be able to recognise these two sets of 'entities' / types in the rest of my data set.
Context on my background
I have done some studies with Python / data science cleaning / processing etc. I am able to do a fair amount of data cleaning myself, write my own functions to process the imported text etc.
I have never really used Spacy (apart from truly basic things like to tokenise a small sample text), and this is my first time using Prodigy.
I can generally figure things out if I'm pointed in the right direction (i.e. if I know it's probably the right direction in which to go).
I don't have a ML / maths background, though I am two years into a software engineering online programme (Ruby / Javascript mainly and a heavy dose of problem-solving).
Data formats
The original data is in pdfs, but this is OCRed and then exported to .txt files. I also have them in a Pandas dataframe (one row for the date each document was issued, a detail which I suppose isn't so important for this task). The text data that comes out the other end isn't perfect, and there are lot of weird characters that get spat out the other end.
Next steps / approaches
My initial thought is that I need to clean the text first to remove the most egregious of issues (multiple spaces, accented characters, strange punctuation marks and so on) BEFORE I input things into Prodigy.
Q: is that a correct assumption?
From what I understand from the documentation, I then have to convert my text into a format that Prodigy likes: ideally jsonl
. From this link I understand that these should be newline-delimited JSON, with the key as 'text' and the value as a sentence.
This suggests to me that I need to split the text into individual sentences. I guess I can uses Spacy to do that? This Spacy documentation suggests I can convert 'files' into Spacy's JSON format, but it didn't work when I tried a generic version (converting a .txt
file). If I could get that working, it seems like that'd give me the output I need for Prodigy. Something like this also might work if I have to do it manually, and then export to JSONL via vanilla Python.
I still don't have a strong sense of how many sentences should be in each entry in the JSONL file. I guess one or two would be good per example?
I saw while browsing the forum that there is this model called Med7 which probably goes a very long way to where I want to be with the first two models — i.e. drug names and dosages. If I could use that as an initial model, then I think I need to use ner.correct
with the med7 model and then I can fine-tune it / correct it for my own data? Is that assumption correct?
Once I get that new.correct
going I train a whole bunch of examples, I save the model, and then I make sure it's actually doing what I think it should do on test examples to validate that approach.
For the final part, the symptom recognition, I could probably gather together a list of 30-50 initial phrases which work as a kind of starter set of symptoms which could bootstrap the process? But I'm not really clear on what I do with those words, how I load them in as initial phrases etc. I watched this video which gave me some sense of how I might approach that, but it felt like there was a lot of magic in there — i.e. tricks which weren't documented anywhere, and I didn't recall seeing them in the prodigy documentation anywhere. But yeah, ideally I want to be able to highlight certain words or small phrases which then incrementally improves my model such that it's able to capture many / all symptoms it encounters on new text.
I'm going to go watch a bunch of the videos on the explosion.ai youtube channel that relate to Prodigy. I'm also going to look at some of the forum posts to see if someone else was where I am at some point in the past.