Seeing some numbers in the CSV output file while I do not have those numbers in my phrase list

AmirNickkar · May 11, 2022, 1:39pm

Hello,

I have created a phrase list and match pattern based on a word list. After training and running the model, I see some numbers in my CSV output file (in one cell). These numbers are both integer and decimal. The count of these numbers is about 120 out of 82k words in the output file. May I ask you please where is the problem and how can I resolve it? by the way, some of the words in my phrase list have some numbers (at the beginning of the word or at the end) but the numbers that I see in my output file are not a part of those words. Also, I used en_core_web_lg for teaching the terms. Thank you.

koaning · May 12, 2022, 6:11pm

The question is a bit unclear to me, so I'll ask some clarifying questions.

After training and running the model, I see some numbers in my CSV output file (in one cell).

What model did you try to train? Are you training a dataset you labelled with Prodigy? How did you generate the CSV output?

Could you share the commands that you ran? If possible, could you explain where something unexpected happened and what you expected to happen instead?

AmirNickkar · May 13, 2022, 5:15pm

Thank you for your replying. Sure, let me explain it from the beginning. I had a text file containing thousands of terms (each line of the txt file has one term). First, I tried to teach the term using the following command (all in Command Prompt):

python -m prodigy terms.teach phrase_list en_core_web_lg --seeds ".\data\terms.txt"

next I tried to create a pattern file using the following command:

python -m prodigy terms.to-patterns phrase_list .\phrase_list_pattern.jsonl --label TERMS --spacy-model blank:en

and then tried to do labeling (annotating) all terms in sample of texts using following command:

python -m prodigy ner.manual phrase_list_annotated blank:en ".\train_data_sample_texts.json" --label TERMs --patterns ".\phrase_list_pattern.jsonl"

The sample text file was an excel file containing two columns (id and text). Like following table:

id	Context
1	text1
2	text2

And finally in order to train and evaluate the model, I used the following command:

python -m prodigy train --ner phrase_list_annotated en_core_web-lg --eval-split 0.2

At the end in order to get the result, I ran the following command in Jupyter Notebook:

nlp = spacy.load(".\en_core_web-lg\model-last")
data_m = srsly.read_json('./train_data_sample_texts.json')
terms = []
id = []

if __name__ == '__main__':
    data_tuples_m = ((eg["text"], eg) for eg in data_m)
    for doc, eg in nlp.pipe(data_tuples_m, as_tuples=True, n_process=3, batch_size=200):
        m_recordid = eg["id"]
        for ent in doc.ents:
            if ent.label_ == "terms":
                terms.append(lemmatizer.lemmatize(ent.text.lower()))
                id.append(m_recordid)
                
df = pd.DataFrame({'id': id, 'terms': terms})

df.to_csv('./prodigy_train_output_terms.csv')

So now in the CSV output file I see some numbers also like the picture below. Although they are not too much, I am wondering why the model pick up numbers instead of texts? I know it is not normal and I want to know where I was wrong. I mean at minimum the model should pick up texts (even not related) not numbers. I used the latest versions of prodigy, spacy and other required libraries. Thanks in advance.

Untitled

koaning · May 16, 2022, 10:48am

I'll start with a sidenote on paths.

nlp = spacy.load(".\en_core_web-lg\model-last")
data_m = srsly.read_json('./train_data_sample_texts.json')

It seems like your examples use ./ and .\ to denote the folder. I'll assume it didn't cause any errors, but you might want to double-check since operating systems may make assumptions you did not anticipate.

For example, on my Linux machine, this is what I see:

> python -m prodigy terms.teach phrase_list en_core_web_md --seeds ".\data\terms.txt"

ℹ Initializing with 1 seed terms
dataterms.txt

Notice that the --seeds value is now interpreted as dataterms.txt. That's because my Linux machine assumes / to indicate the folder structure as opposed to \. I'd need to run that command another way.

> python -m prodigy terms.teach phrase_list en_core_web_md --seeds data\terms.txt

ℹ Initializing with 7 seed terms from data/terms.txt

Now it does recognize that I'm dealing with a file, and it's able to pick up the terms inside. Could you double-check just to make sure you're not experiencing file path issues?

That said, can you confirm that your train_data_sample_texts.json contains examples that should be matched by your seed terms but aren't?

AmirNickkar · May 16, 2022, 3:19pm

Thank you for your answer. I have corrected that path issue but nothing has changed. The prodigy used all seed terms from my txt file. I also checked all file paths again. Regarding your second question, the final model could pick up so many seed terms fromtrain_data_sample_texts.json successfully but yes you are right I can confirm that model could not pick up some seed terms form train_data_sample_texts.json while they should be matched.

koaning · May 16, 2022, 4:00pm

Could you share some of these non-matched items so that I may be able to reproduce the issue locally? It'd be helpful if I had a few lines in the phrase_list_pattern.jsonl file that don't match your train_data_sample_texts.json file.

AmirNickkar · May 17, 2022, 1:02pm

Thank you for your answer. Of course, please see below a few lines of my

phrase_list_pattern.jsonl

file.

{"label":"terms","pattern":[{"lower":"hydrolyzed"},{"lower":"soy"},{"lower":"protein"},{"lower":"pg"},{"lower":"-"},{"lower":"propyl"},{"lower":"methylsilanediol"}]}
{"label":"terms","pattern":[{"lower":"sampangine"}]}
{"label":"terms","pattern":[{"lower":"ticlopidine"}]}
{"label":"terms","pattern":[{"lower":"trilobatin"}]}
{"label":"terms","pattern":[{"lower":"alpha"},{"lower":"-"},{"lower":"damascone"}]}
{"label":"terms","pattern":[{"lower":"di"},{"lower":"("},{"lower":"2"},{"lower":"-"},{"lower":"ethylhexyl)phthalate"}]}
{"label":"terms","pattern":[{"lower":"cis-4"},{"lower":","},{"lower":"4'-dinitrostilbene"}]}

Also please see below a few lines of

train_data_sample_texts.json

{
		"id": 233,
		"text": "  Antioxidant and cryoprotective effects of Amur sturgeon skin gelatin hydrolysates prepared using different commercial proteases in unwashed fish mince were investigated. Gelatin hydrolysates prepared using either Alcalase or Flavourzyme, were effective in preventing lipid oxidation as evidenced by the lower thiobarbituric acid-reactive substances formation. Gelatin hydrolysates were able to retard protein oxidation as indicated by the retarded protein carbonyl formation and lower loss in sulfhydryl content. In the presence of gelatin hydrolysates, unwashed mince had higher transition temperature of myosin and higher enthalpy of myosin and actin as determined by differential scanning calorimetry. Based on low field proton nuclear magnetic resonance analysis, gelatin hydrolysates prevented the displacement of water molecules between the different compartments, thus stabilizing the water associated with myofibrils in unwashed mince induced by repeated freeze-thawing. Oligopeptides in gelatin hydrolysates more likely contributed to the cryoprotective effect. Thus, gelatin hydrolysate could act as both antioxidant and cryoprotectant in unwashed fish mince. Copyright  2015 Elsevier Ltd. All rights reserved. KEYWORDS: Amur sturgeon; Antioxidant activity; Cryoprotective effect; Gelatin hydrolysate; Unwashed fish mince"
	},
	{
		"id": 234,
		"text": "  In the present research, a combined extraction method of ultrasound-assisted extraction (UAE) in conjunction with solid phase extraction (SPE) was applied to isolation and enrichment of selected drugs (metoprolol, ticlopidine, propranolol, carbamazepine, naproxen, acenocumarol, diclofenac, ibuprofen) from fish tissues. The extracted analytes were separated and determined by ultra-high performance liquid chromatography with UV detection (UHPLC-UV) technique. The selectivity of the developed UHPLC-UV method was confirmed by comparison with ultra-high performance liquid chromatography-tandem mass spectrometry (UHPLC-MS/MS) analysis. The important parameters, such as composition of type and pH of extraction solvent, solid/liquid rate volume of extraction solvent and number of extraction cycles were studied. The ultrasonic parameters, such as time, power and temperature of the process were optimized by using a half-fraction factorial central composite design (CCD). The mixture of 10 mL of methanol and 7 mL of water (pH 2.2) (three times) was chosen for the extraction of selected drug from fish tissues. The results showed that the highest recoveries of analytes were obtained with an extraction temperature of 40C, ultrasonic power of 300 W, extraction time of 30 min. Under the optimal conditions, the linearity of method was 0.12-5.00 g/g. The determination coefficients (R(2)) were from 0.979 to 0.998. The limits of detection (LODs) and limits of quantification (LOQs) for the extracted compounds were 0.04-0.17 g/g and 0.12-0.50 g/g, respectively. The recoveries were between 85.5% and 115.8%. Copyright  2015 Elsevier B.V. All rights reserved. KEYWORDS: Chromatography; Drugs; Fish; Ultrasound-assisted extraction"
	},
	{
		"id": 235,
		"text": "  A pretreatment with microwave irradiation was applied to enhance enzyme hydrolysis of corn straw and rice husk immersed in water, aqueous glycerol or alkaline glycerol. Native and pretreated solids underwent enzyme hydrolysis using the extract obtained from the fermentation of Myceliophthora heterothallica, comparing its efficiency with that of the commercial cellulose cocktail Celluclast. The highest saccharification yields, for both corn straw and rice husk, were attained when biomass was pretreated in alkaline glycerol, method that has not been previously reported in literature. Moreover, FTIR, TG and SEM analysis revealed a more significant modification in the structure of corn straw subjected to this pretreatment. Highest global yields were attained with the crude enzyme extract, which might be the result of its content in a great variety of hydrolytic enzymes, as revealed zymogram analysis. Moreover, its hydrolysis efficiency can be improved by its supplementation with commercial -glucosidase. Copyright  2015 Elsevier Ltd. All rights reserved. KEYWORDS: Alkaline glycerol; Enzymatic hydrolysis; Lignocellulosic biomass; Microwave; Pretreatment"
	},
	{
		"id": 236,
		"text": "  English Spanish Los carbohidratos (CHO) simples en el riesgo cardiometablico, conllevan al incremento de la glucemia y los niveles de insulina y, a largo plazo a Diabetes Mellitus tipo 2 (DM2). OBJETIVO: determinar el comportamiento de cifras de glucemia en pacientes DM2 con la ingesta de dos desayunos. METODOLOGA: Se valoraron por antropometra, bioqumica y clnica 14 pacientes con DM2 a quienes se les administr 2 desayunos en tiempos diferentes con 50 g de CHO representados en galleta tipo dulce y pan blanco. RESULTADOS: se evidenci alteracin en el 92,8% de colesterol de baja Densidad (Ldlc), Colesterol Total (CT) y Colesterol de alta densidad (Hdlc) en el 50% y triacilglicerol (TG) en un 35,7%. El comportamiento de la glucemia para el desayuno con galleta no present diferencia significativa en la cifra preprandial y postprandial a las 2 y 3 horas (p= 0,051 y 0,054 respectivamente) la glucemia de las 2 horas con las 3 horas mostraron significancia (p=0,012). En el desayuno con pan blanco la glucemia preprandial y postprandial a las 2 horas aument (p= 0,006), en tanto, que a las 3 horas, la cifra reportada entre las 2 y 3 horas no presentaron diferencias significativas ( p= 0,114 y 0,051 respectivamente). Al comparar cada una de las glucemias de los desayunos en los periodos preprandial a las 2 y 3 horas no se encontraron diferencias estadsticamente significativas (p&gt;0,05). CONCLUSIN: cantidades isocalricas de carbohidratos de 2 desayunos ingeridos en das diferentes se comportaron de igual manera en las cifras de glucemia. El desayuno con galleta favorecera a la poblacin diabtica por los ingredientes utilizados en su elaboracin dada su dislipidemia. Simple Carbohydrates (CHO) in the cardiometabolic risk, lead to the increase of blood glucose and to insulin levels and in the long-term to Diabetes Mellitus type 2( T2DM). OBJECTIVE: To determine the behavior of glycemia figures in T2DM patients with intake of two breakfasts. METHODOLOGY: We evaluated by anthropometry, biochemical and clinical 14 patients with DM2 who were administered 2 breakfasts at different times with 50g of CHO represented in sweet biscuit and white bread. RESULTS: alteration was evident in 92.8% of low-density cholesterol (Ldlc), Total Cholesterol (TC) and high density cholesterol (Hdlc) in 50% and triacylglycerol (TG) in 35.7%. The behavior of blood sugar for breakfast with a sweet biscuit did not show significant difference in the preprandial and postprandial figure at the 2and 3 hours (p = 0.051 and 0.054 respectively) blood glucose 2 hours to 3 hours showed significance (p = 0.012). At breakfast with white bread the preprandial and postprandial blood glucose increased at the 2 hours (p = 0.006), while at the 3 hours, the number reported between 2 and 3 hours did not show significantly difference(p = 0.114 and 0.051 respectively). When comparing each of glycemia of the breakfasts in the preprandial periods at 2 and 3 hours, no statistically significant differences were found (p&gt; 0.05). CONCLUSION: isocaloric amounts of carbohydrates of two eaten breakfasts on different days acted similarly in glycemia figures. Breakfast with cookie favor the diabetic population because of the ingredients used on its preparation given their dyslipidemia. Copyright AULA MEDICA EDICIONES 2014. Published by AULA MEDICA. All rights reserved. Comment in [FIGURES PERFORMANCE OF GLYCEMIA IN TYPE 2 DIABETIC PATIENTS WITH INTAKE OF TWO BREAKFAST WITH THE SAME AMOUNT OF CARBOHYDRATES]. [Nutr Hosp. 2015]"
	}

For example, I did not see the term of "ticlopidine" in my CSV output file. But again my main concern is about those numbers in sample text file that I see them in my CSV output file while I have never annotated them. Thank you.

koaning · May 18, 2022, 8:21am

This is my phrase list file:

{"label":"terms","pattern":[{"lower":"hydrolyzed"},{"lower":"soy"},{"lower":"protein"},{"lower":"pg"},{"lower":"-"},{"lower":"propyl"},{"lower":"methylsilanediol"}]}
{"label":"terms","pattern":[{"lower":"sampangine"}]}
{"label":"terms","pattern":[{"lower":"ticlopidine"}]}
{"label":"terms","pattern":[{"lower":"trilobatin"}]}
{"label":"terms","pattern":[{"lower":"alpha"},{"lower":"-"},{"lower":"damascone"}]}
{"label":"terms","pattern":[{"lower":"di"},{"lower":"("},{"lower":"2"},{"lower":"-"},{"lower":"ethylhexyl)phthalate"}]}
{"label":"terms","pattern":[{"lower":"cis-4"},{"lower":","},{"lower":"4'-dinitrostilbene"}]}

This is my sample text file.

{"id": 233, "text": "Antioxidant and cryoprotective effects of Amur sturgeon skin gelatin hydrolysates prepared using different commercial proteases in unwashed fish mince were investigated. Gelatin hydrolysates prepared using either Alcalase or Flavourzyme, were effective in preventing lipid oxidation as evidenced by the lower thiobarbituric acid-reactive substances formation. Gelatin hydrolysates were able to retard protein oxidation as indicated by the retarded protein carbonyl formation and lower loss in sulfhydryl content. In the presence of gelatin hydrolysates, unwashed mince had higher transition temperature of myosin and higher enthalpy of myosin and actin as determined by differential scanning calorimetry. Based on low field proton nuclear magnetic resonance analysis, gelatin hydrolysates prevented the displacement of water molecules between the different compartments, thus stabilizing the water associated with myofibrils in unwashed mince induced by repeated freeze-thawing. Oligopeptides in gelatin hydrolysates more likely contributed to the cryoprotective effect. Thus, gelatin hydrolysate could act as both antioxidant and cryoprotectant in unwashed fish mince. Copyright  2015 Elsevier Ltd. All rights reserved. KEYWORDS: Amur sturgeon; Antioxidant activity; Cryoprotective effect; Gelatin hydrolysate; Unwashed fish mince"}
{"id": 234, "text": "In the present research, a combined extraction method of ultrasound-assisted extraction (UAE) in conjunction with solid phase extraction (SPE) was applied to isolation and enrichment of selected drugs (metoprolol, ticlopidine, propranolol, carbamazepine, naproxen, acenocumarol, diclofenac, ibuprofen) from fish tissues. The extracted analytes were separated and determined by ultra-high performance liquid chromatography with UV detection (UHPLC-UV) technique. The selectivity of the developed UHPLC-UV method was confirmed by comparison with ultra-high performance liquid chromatography-tandem mass spectrometry (UHPLC-MS/MS) analysis. The important parameters, such as composition of type and pH of extraction solvent, solid/liquid rate volume of extraction solvent and number of extraction cycles were studied. The ultrasonic parameters, such as time, power and temperature of the process were optimized by using a half-fraction factorial central composite design (CCD). The mixture of 10 mL of methanol and 7 mL of water (pH 2.2) (three times) was chosen for the extraction of selected drug from fish tissues. The results showed that the highest recoveries of analytes were obtained with an extraction temperature of 40C, ultrasonic power of 300 W, extraction time of 30 min. Under the optimal conditions, the linearity of method was 0.12-5.00 g/g. The determination coefficients (R(2)) were from 0.979 to 0.998. The limits of detection (LODs) and limits of quantification (LOQs) for the extracted compounds were 0.04-0.17 g/g and 0.12-0.50 g/g, respectively. The recoveries were between 85.5% and 115.8%. Copyright  2015 Elsevier B.V. All rights reserved. KEYWORDS: Chromatography; Drugs; Fish; Ultrasound-assisted extraction"}
{"id": 235, "text": "A pretreatment with microwave irradiation was applied to enhance enzyme hydrolysis of corn straw and rice husk immersed in water, aqueous glycerol or alkaline glycerol. Native and pretreated solids underwent enzyme hydrolysis using the extract obtained from the fermentation of Myceliophthora heterothallica, comparing its efficiency with that of the commercial cellulose cocktail Celluclast. The highest saccharification yields, for both corn straw and rice husk, were attained when biomass was pretreated in alkaline glycerol, method that has not been previously reported in literature. Moreover, FTIR, TG and SEM analysis revealed a more significant modification in the structure of corn straw subjected to this pretreatment. Highest global yields were attained with the crude enzyme extract, which might be the result of its content in a great variety of hydrolytic enzymes, as revealed zymogram analysis. Moreover, its hydrolysis efficiency can be improved by its supplementation with commercial -glucosidase. Copyright  2015 Elsevier Ltd. All rights reserved. KEYWORDS: Alkaline glycerol; Enzymatic hydrolysis; Lignocellulosic biomass; Microwave; Pretreatment"}
{"id": 236, "text": "English Spanish Los carbohidratos (CHO) simples en el riesgo cardiometablico, conllevan al incremento de la glucemia y los niveles de insulina y, a largo plazo a Diabetes Mellitus tipo 2 (DM2). OBJETIVO: determinar el comportamiento de cifras de glucemia en pacientes DM2 con la ingesta de dos desayunos. METODOLOGA: Se valoraron por antropometra, bioqumica y clnica 14 pacientes con DM2 a quienes se les administr 2 desayunos en tiempos diferentes con 50 g de CHO representados en galleta tipo dulce y pan blanco. RESULTADOS: se evidenci alteracin en el 92,8% de colesterol de baja Densidad (Ldlc), Colesterol Total (CT) y Colesterol de alta densidad (Hdlc) en el 50% y triacilglicerol (TG) en un 35,7%. El comportamiento de la glucemia para el desayuno con galleta no present diferencia significativa en la cifra preprandial y postprandial a las 2 y 3 horas (p= 0,051 y 0,054 respectivamente) la glucemia de las 2 horas con las 3 horas mostraron significancia (p=0,012). En el desayuno con pan blanco la glucemia preprandial y postprandial a las 2 horas aument (p= 0,006), en tanto, que a las 3 horas, la cifra reportada entre las 2 y 3 horas no presentaron diferencias significativas ( p= 0,114 y 0,051 respectivamente). Al comparar cada una de las glucemias de los desayunos en los periodos preprandial a las 2 y 3 horas no se encontraron diferencias estadsticamente significativas (p&gt;0,05). CONCLUSIN: cantidades isocalricas de carbohidratos de 2 desayunos ingeridos en das diferentes se comportaron de igual manera en las cifras de glucemia. El desayuno con galleta favorecera a la poblacin diabtica por los ingredientes utilizados en su elaboracin dada su dislipidemia. Simple Carbohydrates (CHO) in the cardiometabolic risk, lead to the increase of blood glucose and to insulin levels and in the long-term to Diabetes Mellitus type 2( T2DM). OBJECTIVE: To determine the behavior of glycemia figures in T2DM patients with intake of two breakfasts. METHODOLOGY: We evaluated by anthropometry, biochemical and clinical 14 patients with DM2 who were administered 2 breakfasts at different times with 50g of CHO represented in sweet biscuit and white bread. RESULTS: alteration was evident in 92.8% of low-density cholesterol (Ldlc), Total Cholesterol (TC) and high density cholesterol (Hdlc) in 50% and triacylglycerol (TG) in 35.7%. The behavior of blood sugar for breakfast with a sweet biscuit did not show significant difference in the preprandial and postprandial figure at the 2and 3 hours (p = 0.051 and 0.054 respectively) blood glucose 2 hours to 3 hours showed significance (p = 0.012). At breakfast with white bread the preprandial and postprandial blood glucose increased at the 2 hours (p = 0.006), while at the 3 hours, the number reported between 2 and 3 hours did not show significantly difference(p = 0.114 and 0.051 respectively). When comparing each of glycemia of the breakfasts in the preprandial periods at 2 and 3 hours, no statistically significant differences were found (p&gt; 0.05). CONCLUSION: isocaloric amounts of carbohydrates of two eaten breakfasts on different days acted similarly in glycemia figures. Breakfast with cookie favor the diabetic population because of the ingredients used on its preparation given their dyslipidemia. Copyright AULA MEDICA EDICIONES 2014. Published by AULA MEDICA. All rights reserved. Comment in [FIGURES PERFORMANCE OF GLYCEMIA IN TYPE 2 DIABETIC PATIENTS WITH INTAKE OF TWO BREAKFAST WITH THE SAME AMOUNT OF CARBOHYDRATES]. [Nutr Hosp. 2015]"}

And when I run Prodigy, like so;

python -m prodigy ner.manual phrase_list_annot blank:en sample_text.jsonl --label TERMs --patterns phrase_list_jsonl.jsonl

Then I can confirm that the phrase list is able to detect this:

Can you confirm this is happening on your machine too? If so, there may be an error in your script that turns the data into a CSV. Do you have any information on the lemmatizer.lemmatize that you're using?

AmirNickkar · May 18, 2022, 1:15pm

Yes, I confirm that my machine can recognize it. We used lemmatizer.lemmatize to standardize our texts. I mean like "apple" and "apples" will refer to the same thing. So now two questions here, do you recommend other types of commands for lemmatizations? is there any relationships between lemmatizations and the numbers that picked up? thank you.

koaning · May 19, 2022, 8:54am

I was zooming in on the lemmatizer.lemmatize method because it seems to be custom code. I'm not exactly sure if the error is introduced there, but since I'm assuming it's custom code I cannot rule it out. The fact that your machine is also able to detect ticlopidine suggests that the error is to be found in your final csv script, not in the patterns file.

When I want to grab a lemma, I usually just directly use spaCy's .lemma_ property.

import spacy
nlp = spacy.load("en_core_web_md")

[t.lemma_ for t in nlp("he buys two apples")]
# ['he', 'buy', 'two', 'apple']

doc = nlp("hi my name is Vincent D. Warmerdam")
doc.ents[0].lemma_ # 'Vincent D. Warmerdam'

Would this .lemma_ also work for you?

AmirNickkar · May 19, 2022, 10:01pm

Thank you. Maybe I did not understand what you referred to as "custom code", lemmatizer.lemmatize is actually a standard lemmatization procedure in NLTK. I used the following commands for using that:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

I even removed that lemmatizer.lemmatize from the script and got the same numbers in my CSV output file again. I think using lemmatizer.lemmatize may not have any effects on having numbers in my output file. The other issue is you changed train_data_sample_texts.json to a "jsonl" file. Do you see any better performance or difference with using a "jsonl" file rather than "json" file for the sample text file? thank you.

koaning · May 20, 2022, 12:12pm

A JSONL file can be read and processed line by line. A JSON file need to be read in completely before it can be processed. That means that if, for example, you're only interested in finding a few examples that fit a pattern that you'll be able to do so with a smaller memory footprint when you use JSONL.

I suppose there's another debugging thing we might be able to try. Your output suggests that at some point 4:04 is added as a term, right? So then you might alter the script so you can see what example is triggering it.

nlp = spacy.load(".\en_core_web-lg\model-last")
data_m = srsly.read_json('./train_data_sample_texts.json')
terms = []
id = []

if __name__ == '__main__':
    data_tuples_m = ((eg["text"], eg) for eg in data_m)
    for doc, eg in nlp.pipe(data_tuples_m, as_tuples=True, n_process=3, batch_size=200):
        m_recordid = eg["id"]
        for ent in doc.ents:
            if ent.label_ == "terms":
                term = lemmatizer.lemmatize(ent.text.lower())
                # Let's try and figure out what docs are causing this behavior.
                if term == "4:04":
                    print(doc)
                terms.append(lemmatizer.lemmatize(ent.text.lower()))
                id.append(m_recordid)
                
df = pd.DataFrame({'id': id, 'terms': terms})

df.to_csv('./prodigy_train_output_terms.csv')

The goal is to figure out how those numbers got introduced. So you should be able to fetch a few docs that got matched and then we can try to figure out what's going wrong from there.

Topic		Replies	Views
Input pattern file to terms.teach	3	318	February 24, 2023
Textcat.teach not using the pattern file enhancement , textcat , done	10	1917	September 20, 2022
Question about terms.patterns solved	3	227	October 17, 2022
Pattern files for textcat.teach usage , textcat	20	3747	July 6, 2018
Chinese pattern file for text classification	3	244	February 24, 2023

Seeing some numbers in the CSV output file while I do not have those numbers in my phrase list

Related topics