How to use loader to load a csv with text and label?

Hi, I have a csv file labeled with person or company. The sample file is like below

ABC Company, company
BCD Company, company
Peter Smith, Person
Mary Smith, Person

I would like to load the csv into prodigy for text classification. Yet, I could not find much information on the implementation of loader. Thank you for your asssistance.

Hi! Just to make sure I understand your question correctly: you want to use your examples in the CSV file to pre-select examples containing those company and person names? Because the example you posted doesn't look like the data you're annotating, right?

In that case, you can convert them to a patterns.jsonl file and then use the file as the --patterns argument. Here's an example:

{"label": "company", "pattern": "ABC Company"}
{"label": "person", "pattern": "Peter Smith"}

If you just want to load a CSV file for annotation, the default CSV loader should also work fine if your examples have a column text. You can also find the docs on custom loaders here. If you're using a custom recipe, all you need is a generator that yields dictionaries, so you don't have to follow any specific implementation. How you set that up is up to you :slightly_smiling_face:

Thanks Ines. I have created a jsonl which has the exact format you mentioned. Just want to make sure I followed it correctly, the code should be

prodigy textcat.manual name_classification file.jsonl --label name,company --pattern file.jsonl

Is this correct? Thanks again

No, that wouldn't work – the textcat.manual recipe doesn't take an argument for pattern files. It will just go through all examples in your data and will ask you to label every example. You can also see some examples of text classification workflows in the docs

What are you trying to achieve with the patterns / the examples in your CSV file? What do they mean, and what do you want to use them for? If you want to use them to select examples to annotate, maybe a workflow like match with --combine-matches and --label-taks would be a better fit?

Hi ines, I am thinking considering to transform the csv to jsonl format and use db-in to resolve the issue. I have db-out my existing db and see the format. Yet, I would like to know do I need to _input_hash, _task_hash,_session_id,_view_id and answer in my jsonl

sample jsonl
{"text":"ABC Limited","label":"company"}
{"text":"John Doe","label":"person"}

db out jsonl
{"text":"ABC Limited","_input_hash":-6232565122,"_task_hash":-520270855,"label":"company","_session_id":null,"_view_id":"classification","answer":"accept"}
{"text":"John Doe","_input_hash":-6232554122,"_task_hash":-510170855,"label":"person","_session_id":null,"_view_id":"classification","answer":"accept"}

Thanks again

answer is required, yes, because it indicates whether the example is a positive or negative example of that label, or whether it was skipped. The hashes are required, but they're generated automatically based on the data, so you don't have to pre-set them. _session_id, _view_id are also generated automatically when you annotate.

Btw, are you sure you want to use those single phrases to train your text classifier? Text classification typically works best if you have actual text, like sentences or paragraphs, and you're typically predicting categories over the whole text. In your example, you seem to just have single phrases, more like named entities?

For my use case, I would like to classify a single phrase eg. John Smith, Tesco Limited into either person or company. Are you suggesting I should use ner instead of textcat? Thank you so much

NER is the task of detecting names etc. in context, so that's an approach you would use if your input consists of whole texts with those phrases in context. But that doesn't seem to be the case here because there's no context, so there's also nothing for the model to learn.

If your input data only consists of single phrases, text classification might not work that well either because... there's not really any text. Text classification algorithms are typically designed to predict labels over text and use the text for clues.

At least for parts of what you're trying to do here, framing it as a prediction task doesn't seem that useful? For instance, do you really need to predict "Tesco Limited"? I doubt you'd get better results using a model than you would with a database lookup and some fuzzy matching. So you probably want to focus on a rule-based system first, also so you have a good baseline and know what you need to beat.