Textcat: Multiple categories per "accept" key?

toadle · September 27, 2024, 6:58am

Hey everybody,

I'm currently working on a textcat (multilabel) that can categorize bank-turnover purposes. So far I've labeled around ~4000 examples.

I find it hard to keep an overview of all my examples therefore I export them to JSONL and skim through them with jq.

I noticed some strange entries in my data and now am worried that there is something wrong with my dataset:

When I run this:

prodigy db-out bank_turnovers_categorization > assets/bank_turnovers_categorization_annotated.jsonl
jq -n '[inputs | .accept] | add' assets/bank_turnovers_categorization_annotated.jsonl
[
  "Sonstige Kosten",
  "Kaltmiete, Nebenkostenvorauszahlung, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Heizkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete",
  "Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung",
  "Kaltmiete, Nebenkostenvorauszahlung, Heizkostenvorauszahlung",
  "Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung",
...
]

You can see that I get several entries, where the .accept key have multiple values. But not as separate entries into the .accept-array, but rather as one single string where entries are separated by a comma.

That somehow looks broken to me. I also noticed that the model is not as good at recognizing bank-turnovers that should be of the categorizes that appear in those "multi-entry-.accept-examples" than it is at reconizing the ones where .accept has only a single value.

Can you advise?

Thanks!

magdaaniol · September 30, 2024, 10:38am

Hi @toadle,

In multilabel textcat annotation the accepted options are stored as a list of strings (the IDs of options) under the accept key. So in general, it is expected to have multiple strings under accept.

I assume that you're concerned about entries such as:

...
"Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung",
...

And you are expecting all these strings to be separate entries like so:

...
"Nebenkostenvorauszahlung",
"Kaltmiete",
"Heizkostenvorauszahlung"
...

Your jq command is doing the aggregation of all accept values. Could you find and share the example that has

"Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung",

as one of the accept value? We should inspect the options field in this example.
Also, are you using a built-in recipe? If not, could you also share how the task stream is constructed in the custom recipe?
Thanks!

toadle · September 30, 2024, 3:47pm

Yes, you are right. I am concerned about the entries where multiple labels get combined into one entry on the .accept key.

Here is one of the full lines from the JSONL with that thing occuring:

{
  "text": "REDACTED",
  "meta": {
    "id": REDACTED,
    "created_at": "REDACTED",
    "counter_holder": "REDACTED",
    "amount": REDACTED
  },
  "options": [
    {
      "id": "Aufzüge",
      "text": "Aufzüge"
    },
    {
      "id": "Bank- & Kontoführungsgebühren",
      "text": "Bank- & Kontoführungsgebühren"
    },
    {
      "id": "Beleuchtung/Strom",
      "text": "Beleuchtung/Strom"
    },
    {
      "id": "Entwässerung",
      "text": "Entwässerung"
    },
    {
      "id": "Gebäudereinigung",
      "text": "Gebäudereinigung"
    },
    {
      "id": "Grundsteuer",
      "text": "Grundsteuer"
    },
    {
      "id": "Hausgeld",
      "text": "Hausgeld"
    },
    {
      "id": "Heizkosten",
      "text": "Heizkosten"
    },
    {
      "id": "Heizkostenvorauszahlung",
      "text": "Heizkostenvorauszahlung"
    },
    {
      "id": "Instandhaltungskosten",
      "text": "Instandhaltungskosten"
    },
    {
      "id": "Kaltmiete",
      "text": "Kaltmiete"
    },
    {
      "id": "Kaltwasser",
      "text": "Kaltwasser"
    },
    {
      "id": "Kaution",
      "text": "Kaution"
    },
    {
      "id": "Modernisierungskosten",
      "text": "Modernisierungskosten"
    },
    {
      "id": "Müllabfuhr",
      "text": "Müllabfuhr"
    },
    {
      "id": "Nachzahlung aus Nebenkostenabrechnung",
      "text": "Nachzahlung aus Nebenkostenabrechnung"
    },
    {
      "id": "Nebenkostenvorauszahlung",
      "text": "Nebenkostenvorauszahlung"
    },
    {
      "id": "Rückzahlung Kaution",
      "text": "Rückzahlung Kaution"
    },
    {
      "id": "Sach- & Haftpflichtversicherungen",
      "text": "Sach- & Haftpflichtversicherungen"
    },
    {
      "id": "Schornsteinfeger",
      "text": "Schornsteinfeger"
    },
    {
      "id": "Sonstige Ausgaben",
      "text": "Sonstige Ausgaben"
    },
    {
      "id": "Sonstige Einnahmen",
      "text": "Sonstige Einnahmen"
    },
    {
      "id": "Sonstige Kosten",
      "text": "Sonstige Kosten"
    },
    {
      "id": "Sonstige Nebenkosten",
      "text": "Sonstige Nebenkosten"
    },
    {
      "id": "Straßenreinigung",
      "text": "Straßenreinigung"
    },
    {
      "id": "TV/Fernsehen",
      "text": "TV/Fernsehen"
    },
    {
      "id": "Tilgung",
      "text": "Tilgung"
    },
    {
      "id": "Werbungskosten",
      "text": "Werbungskosten"
    },
    {
      "id": "Zinsen",
      "text": "Zinsen"
    }
  ],
  "accept": [
    "Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung"
  ],
  "config": {
    "choice_style": "multiple"
  },
  "answer": "accept",
  "_input_hash": 1183031980,
  "_task_hash": 1872175840
}

As stated earlier the only thing that I did to actuall get those entries into the JSONL file was calling:

prodigy db-out bank_turnovers_categorization > assets/bank_turnovers_categorization_annotated.jsonl

And there is not one, but a lot of those entries:

jq -n '[inputs | .accept] | add | group_by(.) | map({key: .[0], value: length}) | from_entries' assets/bank_turnovers_categorization_annotated.jsonl
{
  "Bank- & Kontoführungsgebühren": 4,
  "Beleuchtung/Strom": 81,
  "Entwässerung": 161,
  "Entwässerung, Kaltwasser": 3,
  "Gebäudereinigung": 10,
  "Grundsteuer": 253,
  "Hausgeld": 3,
  "Hausgeld, Hausgeld": 3,
  "Hausgeld, Hausgeld, Instandhaltungskosten": 2,
  "Hausgeld, Instandhaltungskosten, Hausgeld": 1,
  "Heizkosten": 65,
  "Heizkostenvorauszahlung, Kaltmiete, Nebenkostenvorauszahlung": 48,
  "Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete": 42,
  "Instandhaltungskosten": 63,
  "Kaltmiete": 1057,
  "Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung": 45,
  "Kaltmiete, Nebenkostenvorauszahlung": 148,
  "Kaltmiete, Nebenkostenvorauszahlung, Heizkostenvorauszahlung": 31,
  "Kaltmiete, Nebenkostenvorauszahlung, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Heizkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung, Nebenkostenvorauszahlung, Kaltmiete": 1,
  "Kaltwasser": 177,
  "Kaltwasser, Sonstige Nebenkosten": 1,
  "Kaution": 75,
  "Mietminderung": 4,
  "Modernisierungskosten": 14,
  "Müllabfuhr": 20,
  "Nachzahlung aus Nebenkostenabrechnung": 5,
  "Nebenkostenvorauszahlung": 90,
  "Nebenkostenvorauszahlung, Heizkostenvorauszahlung, Kaltmiete": 38,
  "Nebenkostenvorauszahlung, Kaltmiete": 156,
  "Nebenkostenvorauszahlung, Kaltmiete, Heizkostenvorauszahlung": 45,
  "Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung": 1,
  "Nebenkostenvorauszahlung, Kaltmiete, Nebenkostenvorauszahlung, Kaltmiete": 1,
  "Privat": 833,
  "Rückzahlung Kaution": 50,
  "Sach- & Haftpflichtversicherungen": 140,
  "Schornsteinfeger": 1,
  "Sonstige Ausgaben": 12,
  "Sonstige Ausgaben, Aufzüge": 1,
  "Sonstige Einnahmen": 9,
  "Sonstige Kosten": 1,
  "Sonstige Nebenkosten": 39,
  "Sonstige Nebenkosten, Sonstige Ausgaben, Sonstige Nebenkosten, Sonstige Nebenkosten, Sonstige Nebenkosten, Sonstige Nebenkosten, Instandhaltungskosten, Sonstige Nebenkosten, Sonstige Nebenkosten, Sonstige Nebenkosten, Sonstige Nebenkosten, Sonstige Nebenkosten": 1,
  "Straßenreinigung": 8,
  "TV/Fernsehen": 114,
  "Tilgung": 730,
  "Tilgung, Zinsen": 18,
  "Werbungskosten": 229,
  "Zinsen": 703,
  "Zinsen, Tilgung": 11
}

In order to label them I followed the manual by adding only few labels at a time, like this:

prodigy textcat.manual bank_turnovers_categorization assets/bank_turnovers_versicherung.jsonl --label "Sach- & Haftpflichtversicherungen,Werbungskosten,Privat"

Later I resorted to incoporated an LLM to help:

prodigy textcat.llm.correct bank_turnovers_categorization configs/spacy_textcat_llm_config.cfg assets/bank_turnovers_miete.jsonl

magdaaniol · October 1, 2024, 1:34pm

Hi @toadle,

I'm trying to understand the details of your workflow to reproduce the outcome you see.
Did the interface ever offer an option that looked like:

"Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung"

You say you were only annotating a few labels at a time. In the annotated example you provided, there are 29 options. The accepted options consisting of multiple items also contain duplicates (most of the time) Were you performing any re-annotation of the previously annotated dataset or post-hoc aggregation? From the information you provided, I understand that you were appending to the same dataset examples annotated with different sets of labels. Using different input files every time I assume, as well. Please, correct me if I'm wrong.
In any case, a simple re-annotation would not lead to such outcome. The items would be appended to the accept list individually.

Finally, I'm thinking that maybe there's sth wrong with the LLM task responsible for parsing the LLM output in case you were using a custom task implementation? If that was the case you'd see options like "Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung" in the UI. Could you share the LLM config just in case?
Finally, if you could provide a minimal reproducible example that results in such output that would be very helpful.

toadle · October 1, 2024, 3:22pm

Hey @magdaaniol,

so first of all I think the most important info for me is that we clarified that those combined entries on the .accept key are actually NOT correct, right?

So I guess that I have to clear them up in the JSONL and reimport and retrain.

What I find strange is that no where in my workflow (not in Prodigy nor in the trained model) I see these combined labels. The results always only consider the values without the combined labels. I guess that is because the model only considers the values in .options, right?

Regarding my workflow

I'm annotating on examples from a database that I pull samples from. e.g. I'll create a JSONL file with examples for "Miete" which I select via a LIKE-Query from the database. I annotate them either with textcat.manual or with textcat.llm.correct.
What can happen is that using a LIKE-Query the exact same example comes from the database twice: In the first appearance I label it with "Miete". But in the next appearance, where I use a LIKE-query for "Kaution", "Miete" is not part of the labels and I label it "Kaution" because it is actually both. Can that cause problems? I was hoping that the result would be that "Miete" and "Kaution" both end up in the .accept array.

In all my annotating I never saw an option that looked like this Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung. I every only saw each Option individually.

I wonder what does this "broken" data do to the training process? So far it always worked even so the model did not perform as well as I hoped. Does the model also only work on the .options and examples with comma/combinations are considered "without a label"?

Regarding your questions: In my understanding I re-annotated a previous dataset but also added examples. I don't know what a post-hoc aggregation is ?! I did not use jq to somehow fiddle around with the annotations somehow. At least not that I remember

I just checked my workflows each separately in order to reproduce the problem:

prodigy textcat.manual reproduce_multiple_labels assets/bank_turnovers_miete.jsonl --label "Kaltmiete,Mietminderung,Nebenkostenvorauszahlung,Privat" I used this to annotate just a single example with two labels selected. I saved the one annotation and exited prodigy. Then did: prodigy db-out reproduce_multiple_labels > reproduce_multiple_labels.jsonl and indeed both annotation got added to .accept separately.
prodigy textcat.llm.correct reproduce_multiple_labels configs/spacy_textcat_llm_config.cfg assets/bank_turnovers_miete.jsonl I tried this and did the exact same thing. The result was the same. Both selections got added to the .accept key.

For clarification this is the content of my spacy_textcat_llm_config.cfg:

[nlp]
lang = "de"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"
save_io = true

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = ["Nebenkostenvorauszahlung","Kaltmiete", "Mietminderung","Privat"]
exclusive_classes = false

[components.llm.task.label_definitions]
"Kaltmiete" = "Abbuchungen, die eindeutig Mietzahlungen sind. z.B. für Wohnungen, Garagen oder Stellplätze. Warmieten enthalten auch die Kaltmiete. Auch Nachzahlungen für Mieterhöhungen sollte hier eingeordnet werden."
"Mietminderung" = "Abbuchungen, die zwar eine Kaltmiete darstellen, aber aufgrund einer Mietminderung nicht vollständig sind."
"Nebenkostenvorauszahlung" = "Abbuchungen, die eindeutig sagen, dass sie auch Nebenkosten enthalten. z.B. für Strom, Wasser, Heizung, Warmmiete anderweitige Hinweise im Text."
"Privat" = "Abbuchungen, die eindeutig privater Natur sind."

[components.llm.model]
@llm_models = "spacy.GPT-4.v3"
name = "gpt-4o"
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 10
max_batches_in_mem = 10

So far I have not found a reproducible example. I can just say that of 5596 examples that I have in my annotations only 874 containt multiple entries in their .accept-array. Whereas 597 contain an .accept-array with a single entry that contains a comma.

What I'll try and do now: Work with jq to transform my dataset to split all single entry .accept-arrays that contain an entry with a comma into separate entries.

So from

{
"accept": "Nebenkostenvorauszahlung, Kaltmiete, Kaltmiete, Nebenkostenvorauszahlung"
}

I'll try to get to

{
"accept": ["Nebenkostenvorauszahlung", "Kaltmiete"]
}

and reimport that to the prodigy-db before continuing.

toadle · October 2, 2024, 7:24am

OK, I did that like this:

jq -c 'if .accept != null and (.accept | type == "array") and (.accept | length == 1) and (.accept[0] | test(","; "i"))
                              then .accept = (.accept[0] | split(", ") | unique)
                              else .
                              end
                              | .text = (.text | split("\n") | .[1:3] | join("\n"))' assets/bank_turnovers_categorization_annotated.jsonl > assets/bank_turnovers_categorization_annotated_cleaned.jsonl

Heads-Up: I also removed the first paragraph of the .text, because it made my examples more simple.

Then imported the data again:
prodigy db-in bank_turnovers_categorization_cleaned < bank_turnovers_categorization_annotated_cleaned.jsonl

After that I retrained the model:

prodigy train models --textcat-multilabel bank_turnovers_categorization_cleaned --lang "de"

Turns out the model is much more clear in it's results now. So this really helped.

I'm still wondering what the training process does with my former broken example (where all .accepts are concatinated with at ",").

Also: How it happened in the first place. I'll be on the lookout if this happens again. Sadly I can only notice this by exporting and inspecting my annotations. Or can I see that in the SQLite db somehow?

magdaaniol · October 2, 2024, 10:23am

Hi @toadle,

so first of all I think the most important info for me is that we clarified that those combined entries on the .accept key are actually NOT correct, right?

That's right. Each string on the "accept" list is considered a label given to the text. So it's definitely wrong and you did the right thing to clean it up by splitting on comma. If there aren't any options with a comma inside an option that should be sufficient. It would be good to check the resulting list against the IDs in the options as there should be a match. The items on the accept list are supposed to be IDs from the options.
And now to explain how the corpus is read and how Prodigy sets the annotations before training. When reading the corpus, Prodigy iterates over options IDs and checks if the are on the accept list. If they are, this category is set as a label for the example (or one of the labels in the case of multilabel textcat). So, in your first attempt with the malformed dataset, these "combined" labels were ignored as they didn't correspond to any of the categories defined for the example in the options.
So you're right there:

I guess that is because the model only considers the values in .options, right?

Re workflow
Thanks for a detailed description. I can't really see where the combined examples could come from.

What can happen is that using a LIKE-Query the exact same example comes from the database twice: In the first appearance I label it with "Miete". But in the next appearance, where I use a LIKE-query for "Kaution", "Miete" is not part of the labels and I label it "Kaution" because it is actually both. Can that cause problems?

You can definitely add new annotations to already annotated dataset and by default they should be just appended. So that should not be a problem. If you stored the "Kaution" and "Miete" annotations in the same dataset and and the input file contained the same examples - you must have excluded by task_hash or compute a custom input hash. Otherwise, by default, Prodigy would ignore the inputs with exactly the same "text" field (and no other differences wrt to field: "image", "html", "input").

By post-hoc aggregation, I just meant any dataset wrangling after the annotation was done i.e. outside Prodigy. If you don't recall anything relevant and you can't really reproduce the "combined" labels then it will be hard for me to uncover the cause of it. The llm config looks totally fine. My guess is there's some step or condition somewhere that we are missing. Let's be on the lookout for it.

Or can I see that in the SQLite db somehow?

Yes, sure you can check the Example table where all annotations are stored. It has the following columns: Id, input_hash, content and task_hash. The content is JSON object just like the one you're parsing with jq.

You could also add the validate_answer callback to the recipe to check every batch of examples for "combined" labels and bring up a pop up alert if Prodigy is trying to save an invalid accept list.

At least I hope that the main two points are clear 1) how Prodigy handled the combined labels and 2) how to fix the current dataset.

And finally, I would only remove the parts of text field if you're 100% it was irrelevant for the annotation. In general, the model should see exactly what the annotated saw to make sure we're not missing any relevant features.

Topic		Replies	Views
textcat-multilabel annotations format textcat	2	209	January 26, 2024
Yes/no annotations with textcat.manual usage , textcat , solved	3	692	December 21, 2020
Correcting textcat.manual textcat	6	411	November 8, 2022
Making the right selection for multi-label text categorization usage , textcat	1	389	December 7, 2021
Is textcat.teach (as out-of-the-box) appropriate with multilabel tasks? textcat , solved	4	338	June 28, 2022

Textcat: Multiple categories per "accept" key?

Related topics