Different data formats are combined for training?

geo-terms.jsonl (1.8 MB)
Part of the data format is standard BERT training data, and the other part is the model trained using data-SPacy and corrected data labels. Is it feasible to train the data after db-merge between the two? Above is its data

Hi there!

It might help to have a bit more background to answer your question. I've had a quick look at the dataset and it seems to have examples like this:

{"text":"\u91d1\u77ff\u662f\u5f53\u524d\u4e16\u754c\u5404\u56fd\u91cd\u89c6\u7684\u627e\u77ff\u5bf9\u8c61\uff0c\u5728\u6211\u56fd\u662f\u6025\u9700\u7684\u77ed\u7f3a\u77ff\u79cd\u3002","tokens":[{"text":"[CLS]","id":0,"start":0,"end":0,"tokenizer_id":101,"disabled":true,"ws":true},{"text":"\u91d1","id":1,"start":0,"end":1,"tokenizer_id":7032,"disabled":false,"ws":false},{"text":"\u77ff","id":2,"start":1,"end":2,"tokenizer_id":4771,"disabled":false,"ws":false},{"text":"\u662f","id":3,"start":2,"end":3,"tokenizer_id":3221,"disabled":false,"ws":false},{"text":"\u5f53","id":4,"start":3,"end":4,"tokenizer_id":2496,"disabled":false,"ws":false},{"text":"\u524d","id":5,"start":4,"end":5,"tokenizer_id":1184,"disabled":false,"ws":false},{"text":"\u4e16","id":6,"start":5,"end":6,"tokenizer_id":686,"disabled":false,"ws":false},{"text":"\u754c","id":7,"start":6,"end":7,"tokenizer_id":4518,"disabled":false,"ws":false},{"text":"\u5404","id":8,"start":7,"end":8,"tokenizer_id":1392,"disabled":false,"ws":false},{"text":"\u56fd","id":9,"start":8,"end":9,"tokenizer_id":1744,"disabled":false,"ws":false},{"text":"\u91cd","id":10,"start":9,"end":10,"tokenizer_id":7028,"disabled":false,"ws":false},{"text":"\u89c6","id":11,"start":10,"end":11,"tokenizer_id":6228,"disabled":false,"ws":false},{"text":"\u7684","id":12,"start":11,"end":12,"tokenizer_id":4638,"disabled":false,"ws":false},{"text":"\u627e","id":13,"start":12,"end":13,"tokenizer_id":2823,"disabled":false,"ws":false},{"text":"\u77ff","id":14,"start":13,"end":14,"tokenizer_id":4771,"disabled":false,"ws":false},{"text":"\u5bf9","id":15,"start":14,"end":15,"tokenizer_id":2190,"disabled":false,"ws":false},{"text":"\u8c61","id":16,"start":15,"end":16,"tokenizer_id":6496,"disabled":false,"ws":false},{"text":"\uff0c","id":17,"start":16,"end":17,"tokenizer_id":8024,"disabled":false,"ws":false},{"text":"\u5728","id":18,"start":17,"end":18,"tokenizer_id":1762,"disabled":false,"ws":false},{"text":"\u6211","id":19,"start":18,"end":19,"tokenizer_id":2769,"disabled":false,"ws":false},{"text":"\u56fd","id":20,"start":19,"end":20,"tokenizer_id":1744,"disabled":false,"ws":false},{"text":"\u662f","id":21,"start":20,"end":21,"tokenizer_id":3221,"disabled":false,"ws":false},{"text":"\u6025","id":22,"start":21,"end":22,"tokenizer_id":2593,"disabled":false,"ws":false},{"text":"\u9700","id":23,"start":22,"end":23,"tokenizer_id":7444,"disabled":false,"ws":false},{"text":"\u7684","id":24,"start":23,"end":24,"tokenizer_id":4638,"disabled":false,"ws":false},{"text":"\u77ed","id":25,"start":24,"end":25,"tokenizer_id":4764,"disabled":false,"ws":false},{"text":"\u7f3a","id":26,"start":25,"end":26,"tokenizer_id":5375,"disabled":false,"ws":false},{"text":"\u77ff","id":27,"start":26,"end":27,"tokenizer_id":4771,"disabled":false,"ws":false},{"text":"\u79cd","id":28,"start":27,"end":28,"tokenizer_id":4905,"disabled":false,"ws":false},{"text":"\u3002","id":29,"start":28,"end":29,"tokenizer_id":511,"disabled":false,"ws":true},{"text":"[SEP]","id":30,"start":0,"end":0,"tokenizer_id":102,"disabled":true,"ws":true}],"_input_hash":-1469823552,"_task_hash":-1482146336,"_view_id":"ner_manual","spans":[{"start":0,"end":2,"token_start":1,"token_end":2,"label":"\u77ff\u5e8a"}],"answer":"accept","_timestamp":1692634062}

So it looks like geo-terms.jsonl is a dataset that you've extracted via db-out? I'm not sure what recipe was used, but I'm getting the impression that you've used a NER interface with some sort of a custom BERT tokeniser under the hood? Could you share which recipe you've used?

You can use multiple annotated datasets to train a single model, but it you'll want to make sure that both datasets used the same tokenizer.

Side note

You might appreciate this thread where I help a user train a custom spaCy BERT model via Prodigy: