Stuck training some NER models (newbie)

Looking for a few prods in the right direction, if that's possible. I just purchased a license for Prodigy, am very excited about the possibilities, but am rubbing up against a decent amount of frustration at not quite knowing how to do things. A lot of it probably is down to this being new for me etc, hence why the longish post here.

Project goal

I want to train three models:

  • one that can recognise (and therefore enable me to extract) drug names from a given text
  • one that can recognise dosages (for the above medications) from a given text
  • one that can recognise medical symptoms (and enable me to extract / analyse) in a given text.

I have a big text blob (c. 2m words) and I want to use Prodigy to enable me to label some of that data, so that then I can train the two models to be able to recognise these two sets of 'entities' / types in the rest of my data set.

Context on my background

I have done some studies with Python / data science cleaning / processing etc. I am able to do a fair amount of data cleaning myself, write my own functions to process the imported text etc.

I have never really used Spacy (apart from truly basic things like to tokenise a small sample text), and this is my first time using Prodigy.

I can generally figure things out if I'm pointed in the right direction (i.e. if I know it's probably the right direction in which to go).

I don't have a ML / maths background, though I am two years into a software engineering online programme (Ruby / Javascript mainly and a heavy dose of problem-solving).

Data formats

The original data is in pdfs, but this is OCRed and then exported to .txt files. I also have them in a Pandas dataframe (one row for the date each document was issued, a detail which I suppose isn't so important for this task). The text data that comes out the other end isn't perfect, and there are lot of weird characters that get spat out the other end.

Next steps / approaches

My initial thought is that I need to clean the text first to remove the most egregious of issues (multiple spaces, accented characters, strange punctuation marks and so on) BEFORE I input things into Prodigy.
Q: is that a correct assumption?

From what I understand from the documentation, I then have to convert my text into a format that Prodigy likes: ideally jsonl. From this link I understand that these should be newline-delimited JSON, with the key as 'text' and the value as a sentence.

This suggests to me that I need to split the text into individual sentences. I guess I can uses Spacy to do that? This Spacy documentation suggests I can convert 'files' into Spacy's JSON format, but it didn't work when I tried a generic version (converting a .txt file). If I could get that working, it seems like that'd give me the output I need for Prodigy. Something like this also might work if I have to do it manually, and then export to JSONL via vanilla Python.

I still don't have a strong sense of how many sentences should be in each entry in the JSONL file. I guess one or two would be good per example?

I saw while browsing the forum that there is this model called Med7 which probably goes a very long way to where I want to be with the first two models — i.e. drug names and dosages. If I could use that as an initial model, then I think I need to use ner.correct with the med7 model and then I can fine-tune it / correct it for my own data? Is that assumption correct?

Once I get that new.correct going I train a whole bunch of examples, I save the model, and then I make sure it's actually doing what I think it should do on test examples to validate that approach.

For the final part, the symptom recognition, I could probably gather together a list of 30-50 initial phrases which work as a kind of starter set of symptoms which could bootstrap the process? But I'm not really clear on what I do with those words, how I load them in as initial phrases etc. I watched this video which gave me some sense of how I might approach that, but it felt like there was a lot of magic in there — i.e. tricks which weren't documented anywhere, and I didn't recall seeing them in the prodigy documentation anywhere. But yeah, ideally I want to be able to highlight certain words or small phrases which then incrementally improves my model such that it's able to capture many / all symptoms it encounters on new text.

I'm going to go watch a bunch of the videos on the explosion.ai youtube channel that relate to Prodigy. I'm also going to look at some of the forum posts to see if someone else was where I am at some point in the past.

Hi! It sounds like you're on the right track :slightly_smiling_face: I'd say one of the trickiest parts of this type of applied NLP is reasoning about your model and how to structure your tasks so that they're both easy to learn and easy to annotate. There's often no easy answer for it because it's so specific to the problems you're trying to solve, and it requires some trial and error. (My talk here outlines that idea and the problems in more detail.)

If you have a process that can fix the most obvious issues programmatically, that's definitely good! Just make sure that you always run the logic at runtime as well, so your model only ever gets to see preprocessed text – basically, text preprocessed in the same way as the training data.

Yes, that's an option – or if you have shorter paragraphs, you could also just split on \n\n. That depends on what your data looks like. Prodigy's ner.correct will also split sentences by default using the spaCy model you start with. So that might be all you need!

This part is all just related to training a model with spaCy from scratch and converting training corpora – none of that should be relevant here. If you wanted to split your text into sentences using spaCy, I think the solution might be a lot more straightforward than you think :smiley: Basically, all you'd need is something like this:

nlp = spacy.load("en_core_web_sm")  # or whatever
for doc in nlp.pipe(texts):
    for sent in doc.sents:
        text = sent.text  # This is the sentence text to export etc.

But as I mentioned above, if you just run ner.correct, it should already split your text into sentences for you.

Yes, pretty much. The first thing the pretrained model is going to help you with is the data creation: it can help you label things so you don't have to. Even only gets 50% right, that's still 50% less work for you.

Once you've created your training examples, you could then update the existing model with them and see if you can improve it on your data. How well this works depends on the data. Maybe you also decide that you want to train an entirely new model from scratch – with Prodigy and a model to help you label, you can easily create a few thousand annotations in a day or two.

Taking inspiration from Med7 is definitely a good idea, since they've come up with an annotation scheme that seems to be working quite well. For NER, you typically want to avoid annotating spans with ambiguous boundaries because that's where a model will likely struggle – after all, NER is all about predicting boundaries.

You probably want to start with the more recent NER video I made, because it shows a more modern workflow and some newer features that weren't available yet when Matt's video was filmed: https://www.youtube.com/watch?v=59BKHO_xBPA

I think this might also be less magical than you expect! The underlying idea is this: you could just start and label everything from scratch by hand, but that's annoying and a lot of work. For example, you could go and manually highlight "Aspirin" or "acetylsalicylic acid" every time you see those phrases – but you already know that in 99% of the cases, it's an entity you're looking for. So you might as well have your annotation tool highlight those for you, based on a list you give it. To do this, Prodigy takes advantage of spaCy's match patterns, which give you a lot of flexibility to describe those terminology lists.

But the next inconvenience that makes things less efficient is that you actually have to come up with these word lists. In your case, it's maybe easier because you can leverage existing dictionaries and resources. But for other stuff (like recipe ingredients, brands etc.), you'd have to first create your word lists and you might miss a lot of very common phrases (including common typos etc). So one workflow that Prodigy implements in terms.teach and sense2vec.teach is to use word vectors and find the most similar words based on a few examples. You can then quickly bootstrap a word list, just by clicking "accept" or "reject", and then export it to a patterns file using terms.to-patterns.

You don't have to do any of this – those workflows are just suggestions for how to make the annotation process faster and more efficient, and to prevent humans from doing things that a machine can do pretty well.

Some relevant links on the forum:

Thank you so much for the detailed reply. I have lots to be busy with tomorrow in this. Thanks again. I will post back how I'm getting on, if only to keep track of things I'm learning along the way!