Synthetic NER data

madhujahagirdar · February 3, 2018, 3:47am

I would like to use apache c-take or similar system to generate healthcare concepts
on the input text, focused on healthcare. Once I have the data from these systems,
can I convert it to prodigy format (json or another format) and do db-in and then build a word2vec using terms.train
to have a word2vec specific to healthcare.

honnibal · February 5, 2018, 5:09am

In general: yes, there should be no problem with using annotations from a different tool in Prodigy. You can either import them into a dataset, or just make a .jsonl file and use it as the input.

However, I’m not sure a word2vec model is what you want to build. word2vec usually works on raw text — you don’t need any annotations. Sometimes you can benefit from annotations before training word2vec, to learn vectors for longer phrases. Is that what you’re looking to do, or do you have something else in mind?

Topic		Replies	Views
Stuck training some NER models (newbie) usage , ner , best-practices	2	1034	July 16, 2020
Pre-Train Spacy NER for healthcare data usage , ner , spacy	1	1153	January 27, 2018
Annotated Dataset and NER task with Prodigy usage , ner	6	887	February 3, 2023
Annotating custom entities in job descriptions usage , custom , hr	9	1160	June 2, 2019
Training NER models with synthetic data sets usage , ner , spacy , solved	13	2959	August 26, 2019

Synthetic NER data

Related topics