How to use loader to load a csv with text and label?

PlataleaMinor · June 2, 2020, 9:45am

Hi, I have a csv file labeled with person or company. The sample file is like below

text,label
ABC Company, company
BCD Company, company
Peter Smith, Person
Mary Smith, Person

I would like to load the csv into prodigy for text classification. Yet, I could not find much information on the implementation of loader. Thank you for your asssistance.

ines · June 2, 2020, 2:00pm

Hi! Just to make sure I understand your question correctly: you want to use your examples in the CSV file to pre-select examples containing those company and person names? Because the example you posted doesn't look like the data you're annotating, right?

In that case, you can convert them to a patterns.jsonl file and then use the file as the --patterns argument. Here's an example:

{"label": "company", "pattern": "ABC Company"}
{"label": "person", "pattern": "Peter Smith"}

If you just want to load a CSV file for annotation, the default CSV loader should also work fine if your examples have a column text. You can also find the docs on custom loaders here. If you're using a custom recipe, all you need is a generator that yields dictionaries, so you don't have to follow any specific implementation. How you set that up is up to you

PlataleaMinor · June 3, 2020, 7:58am

Thanks Ines. I have created a jsonl which has the exact format you mentioned. Just want to make sure I followed it correctly, the code should be

prodigy textcat.manual name_classification file.jsonl --label name,company --pattern file.jsonl

Is this correct? Thanks again

ines · June 3, 2020, 11:21am

No, that wouldn't work – the textcat.manual recipe doesn't take an argument for pattern files. It will just go through all examples in your data and will ask you to label every example. You can also see some examples of text classification workflows in the docs

What are you trying to achieve with the patterns / the examples in your CSV file? What do they mean, and what do you want to use them for? If you want to use them to select examples to annotate, maybe a workflow like match with --combine-matches and --label-taks would be a better fit?

PlataleaMinor · June 9, 2020, 8:45am

Hi ines, I am thinking considering to transform the csv to jsonl format and use db-in to resolve the issue. I have db-out my existing db and see the format. Yet, I would like to know do I need to _input_hash, _task_hash,_session_id,_view_id and answer in my jsonl

sample jsonl
{"text":"ABC Limited","label":"company"}
{"text":"John Doe","label":"person"}

db out jsonl
{"text":"ABC Limited","_input_hash":-6232565122,"_task_hash":-520270855,"label":"company","_session_id":null,"_view_id":"classification","answer":"accept"}
{"text":"John Doe","_input_hash":-6232554122,"_task_hash":-510170855,"label":"person","_session_id":null,"_view_id":"classification","answer":"accept"}

Thanks again

ines · June 10, 2020, 7:51am

answer is required, yes, because it indicates whether the example is a positive or negative example of that label, or whether it was skipped. The hashes are required, but they're generated automatically based on the data, so you don't have to pre-set them. _session_id, _view_id are also generated automatically when you annotate.

Btw, are you sure you want to use those single phrases to train your text classifier? Text classification typically works best if you have actual text, like sentences or paragraphs, and you're typically predicting categories over the whole text. In your example, you seem to just have single phrases, more like named entities?

PlataleaMinor · June 17, 2020, 2:22am

For my use case, I would like to classify a single phrase eg. John Smith, Tesco Limited into either person or company. Are you suggesting I should use ner instead of textcat? Thank you so much

ines · June 17, 2020, 9:37am

NER is the task of detecting names etc. in context, so that's an approach you would use if your input consists of whole texts with those phrases in context. But that doesn't seem to be the case here because there's no context, so there's also nothing for the model to learn.

If your input data only consists of single phrases, text classification might not work that well either because... there's not really any text. Text classification algorithms are typically designed to predict labels over text and use the text for clues.

At least for parts of what you're trying to do here, framing it as a prediction task doesn't seem that useful? For instance, do you really need to predict "Tesco Limited"? I doubt you'd get better results using a model than you would with a database lookup and some fuzzy matching. So you probably want to focus on a rule-based system first, also so you have a good baseline and know what you need to beat.

Topic		Replies	Views
error while loading pre-annotated jsonl file usage , textcat , solved	9	539	March 29, 2023
can't run textcat.teach getting Error while validating stream: no first example usage	3	2319	February 19, 2019
CSV File Text Annotation usage , solved	3	3053	March 11, 2020
Using your UI on imported data for classification and annotation usage , ner , textcat	5	1217	August 28, 2018
Bulk import textcat examples	2	24	April 29, 2025

How to use loader to load a csv with text and label?

Related topics