I am learning to use NER in SpaCy (Prodigy) and I have a query. I would like to know what is the recommended text length for training NER models in spaCy . For example, I am working with texts of 425 words on average, is this adequate?
As far as I'm aware there's no recommended text length when it comes to predictive accuracy. It may depend on the application that you're interested in, but I'm not aware of a "general rule".
That said, what I can imagine is that it's much easier to annotate texts that are shorter; maybe one or a few sentences. The main thing you typically want to prevent is that folks need to do a lot of scrolling in order to annotate. This takes up a lot of time and also interrupts the labelling flow.
By only showing a few sentences in the labelling interface you might get more/higher quality labels, which in turn will have a big impact on your NER model. So it could make sense to split the longer texts into shorter chunks during labelling.
If you can share more about the application I might be able to give more advise.
I happen to be collaborating with @agustinadinamarca for a couple of weeks, and regarding the same question she posted originally, I would add the following queries:
Other than "being easier to annotate", is there any other advantage in using shorter texts, before larger ones?
1.1. PS.: I know the rule of thumb here is "try and see", but in your own experience, have notice some trend?
1.2. In my own experience, I have trained some spaCy NER models, but I have only worked with rather small texts (<=25 words), and they tend to be "highly accurate on the first trials, but prone to catastrophic forgetting", however nothing like what we have right now. You might be thinking "well, just trim the extension", but point "2" below provides more details about this, I appreciate your patience .
1.3. Prodigy provides some tools to make experimentation easier, for instance train-curve. By using it, we have preliminarily concluded that, in our case and with the current text length (~425 words), "we seem to need WAY more samples" (i.e., find here a quick, actual test result). Regarding "how many more samples do we need", we will tag ~4500 texts more, and see if afterwards it does indeed improve the model metrics (we are a small team now, so even when that batch seems like small, it still involves a considerable time budget for us ); however we know many more samples could be needed.
We are also hesitant about reducing the "text word count" because for our own use case, the labeled entities are a bit scarce in average, in the text (for reference, I'd say that there are <=10 entites in those 425 words mentioned above), and "by reducing the text word count" we are exposing ourselves into having lots of rejected texts during the labeling, as there will be chunks of it without any entity.
I hope that after these clarifications, you can help us with some newer / further recommendations, they will be very well appreciated.
The most general principle I like to use in ML is that whatever I'm modelling should be applied in a use-case and that the use-case dictates what tricks to apply (not the other way around!). If you have a use-case that, for example, is about Reddit comments then it can be fine that the examples are short because many comments on Reddit are short. No need to worry about making them longer!
If you could share anything about the type of NER you're trying to tackle (insult detection, name detection, etc) then I might be able to give more precise advice based on my own experience. Descriptions like "small texts" can mean a lot of things and the challenges are usually more tightly coupled with a domain. In the case of internet data; I might worry about special characters and spelling a whole lot more compared to legal documents.
Not sure if what I wrote made you think otherwise, but just to confirm: Yes, we agree with you.
If you could share anything about the type of NER you're trying to tackle (insult detection, name detection, etc) then I might be able to give more precise advice based on my own experience.
Gladly. You will find a single, already-labeled sample below this paragraph. What we are aiming for, is identifying specific types of professional skills through Named Entity Recognition (HARDSKILL, SOFTSKILL, SOFTWARE, SPECTOOL), in Job Posting - related texts. You may notice some special characters kept (which we know is not usual in NLP-related tasks), however specifically for 'SOFTWARE' entities, we might need the special characters (i.e., C++, C#, etc), and for that reason, we are keeping them; however we can remove them easily.
{"text":"overview: our interactive marketing services initiate the introduction of new products and services to a myriad of customers all over the globe. through various sales channels, our attentive team members maintain a sensitivity to the trepidation that naturally arises when customers are faced with a novel idea. our masterful execution and consistency in customer communications has retained and expanded our client base exponentially in recent years. we have an opportunity for an account coordinator to train with our intuitive team and represent our revered roster of clients. in this role: you will be the customer\u2019s first impression of our clients you\u2019ll develop a comprehensive knowledge of our clients you\u2019ll meet people from all walks of life and find common ground you will break the ice and establish quick relationships you\u2019ll implement consultative strategies to maximize profit you\u2019ll seamlessly complete sales while answering customer inquiries you\u2019ll need: gregarious personality with excellent people skills preference for face to face interactions and variety insatiable work ethic with ambitious goals time management skills and ability to prioritize collaborative skills with a team oriented mentality resilience and tenacity when faced with challenges active listening skills, empathy, and compassion we\u2019ll give you: individualized, paid training with direct support simulations, field practice, and classroom education animated team members with weekly team activities incentives for growth, travel, and bonuses rewards and recognition for top performance salary: $36,500.00 to $54,600.00 /year","_input_hash":-776134784,"_task_hash":-2011797352,"_is_binary":false,"tokens":[{"text":"overview","start":0,"end":8,"id":0,"ws":false},{"text":":","start":8,"end":9,"id":1,"ws":true},{"text":"our","start":10,"end":13,"id":2,"ws":true},{"text":"interactive","start":14,"end":25,"id":3,"ws":true},{"text":"marketing","start":26,"end":35,"id":4,"ws":true},{"text":"services","start":36,"end":44,"id":5,"ws":true},{"text":"initiate","start":45,"end":53,"id":6,"ws":true},{"text":"the","start":54,"end":57,"id":7,"ws":true},{"text":"introduction","start":58,"end":70,"id":8,"ws":true},{"text":"of","start":71,"end":73,"id":9,"ws":true},{"text":"new","start":74,"end":77,"id":10,"ws":true},{"text":"products","start":78,"end":86,"id":11,"ws":true},{"text":"and","start":87,"end":90,"id":12,"ws":true},{"text":"services","start":91,"end":99,"id":13,"ws":true},{"text":"to","start":100,"end":102,"id":14,"ws":true},{"text":"a","start":103,"end":104,"id":15,"ws":true},{"text":"myriad","start":105,"end":111,"id":16,"ws":true},{"text":"of","start":112,"end":114,"id":17,"ws":true},{"text":"customers","start":115,"end":124,"id":18,"ws":true},{"text":"all","start":125,"end":128,"id":19,"ws":true},{"text":"over","start":129,"end":133,"id":20,"ws":true},{"text":"the","start":134,"end":137,"id":21,"ws":true},{"text":"globe","start":138,"end":143,"id":22,"ws":false},{"text":".","start":143,"end":144,"id":23,"ws":true},{"text":"through","start":145,"end":152,"id":24,"ws":true},{"text":"various","start":153,"end":160,"id":25,"ws":true},{"text":"sales","start":161,"end":166,"id":26,"ws":true},{"text":"channels","start":167,"end":175,"id":27,"ws":false},{"text":",","start":175,"end":176,"id":28,"ws":true},{"text":"our","start":177,"end":180,"id":29,"ws":true},{"text":"attentive","start":181,"end":190,"id":30,"ws":true},{"text":"team","start":191,"end":195,"id":31,"ws":true},{"text":"members","start":196,"end":203,"id":32,"ws":true},{"text":"maintain","start":204,"end":212,"id":33,"ws":true},{"text":"a","start":213,"end":214,"id":34,"ws":true},{"text":"sensitivity","start":215,"end":226,"id":35,"ws":true},{"text":"to","start":227,"end":229,"id":36,"ws":true},{"text":"the","start":230,"end":233,"id":37,"ws":true},{"text":"trepidation","start":234,"end":245,"id":38,"ws":true},{"text":"that","start":246,"end":250,"id":39,"ws":true},{"text":"naturally","start":251,"end":260,"id":40,"ws":true},{"text":"arises","start":261,"end":267,"id":41,"ws":true},{"text":"when","start":268,"end":272,"id":42,"ws":true},{"text":"customers","start":273,"end":282,"id":43,"ws":true},{"text":"are","start":283,"end":286,"id":44,"ws":true},{"text":"faced","start":287,"end":292,"id":45,"ws":true},{"text":"with","start":293,"end":297,"id":46,"ws":true},{"text":"a","start":298,"end":299,"id":47,"ws":true},{"text":"novel","start":300,"end":305,"id":48,"ws":true},{"text":"idea","start":306,"end":310,"id":49,"ws":false},{"text":".","start":310,"end":311,"id":50,"ws":true},{"text":"our","start":312,"end":315,"id":51,"ws":true},{"text":"masterful","start":316,"end":325,"id":52,"ws":true},{"text":"execution","start":326,"end":335,"id":53,"ws":true},{"text":"and","start":336,"end":339,"id":54,"ws":true},{"text":"consistency","start":340,"end":351,"id":55,"ws":true},{"text":"in","start":352,"end":354,"id":56,"ws":true},{"text":"customer","start":355,"end":363,"id":57,"ws":true},{"text":"communications","start":364,"end":378,"id":58,"ws":true},{"text":"has","start":379,"end":382,"id":59,"ws":true},{"text":"retained","start":383,"end":391,"id":60,"ws":true},{"text":"and","start":392,"end":395,"id":61,"ws":true},{"text":"expanded","start":396,"end":404,"id":62,"ws":true},{"text":"our","start":405,"end":408,"id":63,"ws":true},{"text":"client","start":409,"end":415,"id":64,"ws":true},{"text":"base","start":416,"end":420,"id":65,"ws":true},{"text":"exponentially","start":421,"end":434,"id":66,"ws":true},{"text":"in","start":435,"end":437,"id":67,"ws":true},{"text":"recent","start":438,"end":444,"id":68,"ws":true},{"text":"years","start":445,"end":450,"id":69,"ws":false},{"text":".","start":450,"end":451,"id":70,"ws":true},{"text":"we","start":452,"end":454,"id":71,"ws":true},{"text":"have","start":455,"end":459,"id":72,"ws":true},{"text":"an","start":460,"end":462,"id":73,"ws":true},{"text":"opportunity","start":463,"end":474,"id":74,"ws":true},{"text":"for","start":475,"end":478,"id":75,"ws":true},{"text":"an","start":479,"end":481,"id":76,"ws":true},{"text":"account","start":482,"end":489,"id":77,"ws":true},{"text":"coordinator","start":490,"end":501,"id":78,"ws":true},{"text":"to","start":502,"end":504,"id":79,"ws":true},{"text":"train","start":505,"end":510,"id":80,"ws":true},{"text":"with","start":511,"end":515,"id":81,"ws":true},{"text":"our","start":516,"end":519,"id":82,"ws":true},{"text":"intuitive","start":520,"end":529,"id":83,"ws":true},{"text":"team","start":530,"end":534,"id":84,"ws":true},{"text":"and","start":535,"end":538,"id":85,"ws":true},{"text":"represent","start":539,"end":548,"id":86,"ws":true},{"text":"our","start":549,"end":552,"id":87,"ws":true},{"text":"revered","start":553,"end":560,"id":88,"ws":true},{"text":"roster","start":561,"end":567,"id":89,"ws":true},{"text":"of","start":568,"end":570,"id":90,"ws":true},{"text":"clients","start":571,"end":578,"id":91,"ws":false},{"text":".","start":578,"end":579,"id":92,"ws":true},{"text":"in","start":580,"end":582,"id":93,"ws":true},{"text":"this","start":583,"end":587,"id":94,"ws":true},{"text":"role","start":588,"end":592,"id":95,"ws":false},{"text":":","start":592,"end":593,"id":96,"ws":true},{"text":"you","start":594,"end":597,"id":97,"ws":true},{"text":"will","start":598,"end":602,"id":98,"ws":true},{"text":"be","start":603,"end":605,"id":99,"ws":true},{"text":"the","start":606,"end":609,"id":100,"ws":true},{"text":"customer","start":610,"end":618,"id":101,"ws":false},{"text":"\u2019s","start":618,"end":620,"id":102,"ws":true},{"text":"first","start":621,"end":626,"id":103,"ws":true},{"text":"impression","start":627,"end":637,"id":104,"ws":true},{"text":"of","start":638,"end":640,"id":105,"ws":true},{"text":"our","start":641,"end":644,"id":106,"ws":true},{"text":"clients","start":645,"end":652,"id":107,"ws":true},{"text":"you","start":653,"end":656,"id":108,"ws":false},{"text":"\u2019ll","start":656,"end":659,"id":109,"ws":true},{"text":"develop","start":660,"end":667,"id":110,"ws":true},{"text":"a","start":668,"end":669,"id":111,"ws":true},{"text":"comprehensive","start":670,"end":683,"id":112,"ws":true},{"text":"knowledge","start":684,"end":693,"id":113,"ws":true},{"text":"of","start":694,"end":696,"id":114,"ws":true},{"text":"our","start":697,"end":700,"id":115,"ws":true},{"text":"clients","start":701,"end":708,"id":116,"ws":true},{"text":"you","start":709,"end":712,"id":117,"ws":false},{"text":"\u2019ll","start":712,"end":715,"id":118,"ws":true},{"text":"meet","start":716,"end":720,"id":119,"ws":true},{"text":"people","start":721,"end":727,"id":120,"ws":true},{"text":"from","start":728,"end":732,"id":121,"ws":true},{"text":"all","start":733,"end":736,"id":122,"ws":true},{"text":"walks","start":737,"end":742,"id":123,"ws":true},{"text":"of","start":743,"end":745,"id":124,"ws":true},{"text":"life","start":746,"end":750,"id":125,"ws":true},{"text":"and","start":751,"end":754,"id":126,"ws":true},{"text":"find","start":755,"end":759,"id":127,"ws":true},{"text":"common","start":760,"end":766,"id":128,"ws":true},{"text":"ground","start":767,"end":773,"id":129,"ws":true},{"text":"you","start":774,"end":777,"id":130,"ws":true},{"text":"will","start":778,"end":782,"id":131,"ws":true},{"text":"break","start":783,"end":788,"id":132,"ws":true},{"text":"the","start":789,"end":792,"id":133,"ws":true},{"text":"ice","start":793,"end":796,"id":134,"ws":true},{"text":"and","start":797,"end":800,"id":135,"ws":true},{"text":"establish","start":801,"end":810,"id":136,"ws":true},{"text":"quick","start":811,"end":816,"id":137,"ws":true},{"text":"relationships","start":817,"end":830,"id":138,"ws":true},{"text":"you","start":831,"end":834,"id":139,"ws":false},{"text":"\u2019ll","start":834,"end":837,"id":140,"ws":true},{"text":"implement","start":838,"end":847,"id":141,"ws":true},{"text":"consultative","start":848,"end":860,"id":142,"ws":true},{"text":"strategies","start":861,"end":871,"id":143,"ws":true},{"text":"to","start":872,"end":874,"id":144,"ws":true},{"text":"maximize","start":875,"end":883,"id":145,"ws":true},{"text":"profit","start":884,"end":890,"id":146,"ws":true},{"text":"you","start":891,"end":894,"id":147,"ws":false},{"text":"\u2019ll","start":894,"end":897,"id":148,"ws":true},{"text":"seamlessly","start":898,"end":908,"id":149,"ws":true},{"text":"complete","start":909,"end":917,"id":150,"ws":true},{"text":"sales","start":918,"end":923,"id":151,"ws":true},{"text":"while","start":924,"end":929,"id":152,"ws":true},{"text":"answering","start":930,"end":939,"id":153,"ws":true},{"text":"customer","start":940,"end":948,"id":154,"ws":true},{"text":"inquiries","start":949,"end":958,"id":155,"ws":true},{"text":"you","start":959,"end":962,"id":156,"ws":false},{"text":"\u2019ll","start":962,"end":965,"id":157,"ws":true},{"text":"need","start":966,"end":970,"id":158,"ws":false},{"text":":","start":970,"end":971,"id":159,"ws":true},{"text":"gregarious","start":972,"end":982,"id":160,"ws":true},{"text":"personality","start":983,"end":994,"id":161,"ws":true},{"text":"with","start":995,"end":999,"id":162,"ws":true},{"text":"excellent","start":1000,"end":1009,"id":163,"ws":true},{"text":"people","start":1010,"end":1016,"id":164,"ws":true},{"text":"skills","start":1017,"end":1023,"id":165,"ws":true},{"text":"preference","start":1024,"end":1034,"id":166,"ws":true},{"text":"for","start":1035,"end":1038,"id":167,"ws":true},{"text":"face","start":1039,"end":1043,"id":168,"ws":true},{"text":"to","start":1044,"end":1046,"id":169,"ws":true},{"text":"face","start":1047,"end":1051,"id":170,"ws":true},{"text":"interactions","start":1052,"end":1064,"id":171,"ws":true},{"text":"and","start":1065,"end":1068,"id":172,"ws":true},{"text":"variety","start":1069,"end":1076,"id":173,"ws":true},{"text":"insatiable","start":1077,"end":1087,"id":174,"ws":true},{"text":"work","start":1088,"end":1092,"id":175,"ws":true},{"text":"ethic","start":1093,"end":1098,"id":176,"ws":true},{"text":"with","start":1099,"end":1103,"id":177,"ws":true},{"text":"ambitious","start":1104,"end":1113,"id":178,"ws":true},{"text":"goals","start":1114,"end":1119,"id":179,"ws":true},{"text":"time","start":1120,"end":1124,"id":180,"ws":true},{"text":"management","start":1125,"end":1135,"id":181,"ws":true},{"text":"skills","start":1136,"end":1142,"id":182,"ws":true},{"text":"and","start":1143,"end":1146,"id":183,"ws":true},{"text":"ability","start":1147,"end":1154,"id":184,"ws":true},{"text":"to","start":1155,"end":1157,"id":185,"ws":true},{"text":"prioritize","start":1158,"end":1168,"id":186,"ws":true},{"text":"collaborative","start":1169,"end":1182,"id":187,"ws":true},{"text":"skills","start":1183,"end":1189,"id":188,"ws":true},{"text":"with","start":1190,"end":1194,"id":189,"ws":true},{"text":"a","start":1195,"end":1196,"id":190,"ws":true},{"text":"team","start":1197,"end":1201,"id":191,"ws":true},{"text":"oriented","start":1202,"end":1210,"id":192,"ws":true},{"text":"mentality","start":1211,"end":1220,"id":193,"ws":true},{"text":"resilience","start":1221,"end":1231,"id":194,"ws":true},{"text":"and","start":1232,"end":1235,"id":195,"ws":true},{"text":"tenacity","start":1236,"end":1244,"id":196,"ws":true},{"text":"when","start":1245,"end":1249,"id":197,"ws":true},{"text":"faced","start":1250,"end":1255,"id":198,"ws":true},{"text":"with","start":1256,"end":1260,"id":199,"ws":true},{"text":"challenges","start":1261,"end":1271,"id":200,"ws":true},{"text":"active","start":1272,"end":1278,"id":201,"ws":true},{"text":"listening","start":1279,"end":1288,"id":202,"ws":true},{"text":"skills","start":1289,"end":1295,"id":203,"ws":false},{"text":",","start":1295,"end":1296,"id":204,"ws":true},{"text":"empathy","start":1297,"end":1304,"id":205,"ws":false},{"text":",","start":1304,"end":1305,"id":206,"ws":true},{"text":"and","start":1306,"end":1309,"id":207,"ws":true},{"text":"compassion","start":1310,"end":1320,"id":208,"ws":true},{"text":"we","start":1321,"end":1323,"id":209,"ws":false},{"text":"\u2019ll","start":1323,"end":1326,"id":210,"ws":true},{"text":"give","start":1327,"end":1331,"id":211,"ws":true},{"text":"you","start":1332,"end":1335,"id":212,"ws":false},{"text":":","start":1335,"end":1336,"id":213,"ws":true},{"text":"individualized","start":1337,"end":1351,"id":214,"ws":false},{"text":",","start":1351,"end":1352,"id":215,"ws":true},{"text":"paid","start":1353,"end":1357,"id":216,"ws":true},{"text":"training","start":1358,"end":1366,"id":217,"ws":true},{"text":"with","start":1367,"end":1371,"id":218,"ws":true},{"text":"direct","start":1372,"end":1378,"id":219,"ws":true},{"text":"support","start":1379,"end":1386,"id":220,"ws":true},{"text":"simulations","start":1387,"end":1398,"id":221,"ws":false},{"text":",","start":1398,"end":1399,"id":222,"ws":true},{"text":"field","start":1400,"end":1405,"id":223,"ws":true},{"text":"practice","start":1406,"end":1414,"id":224,"ws":false},{"text":",","start":1414,"end":1415,"id":225,"ws":true},{"text":"and","start":1416,"end":1419,"id":226,"ws":true},{"text":"classroom","start":1420,"end":1429,"id":227,"ws":true},{"text":"education","start":1430,"end":1439,"id":228,"ws":true},{"text":"animated","start":1440,"end":1448,"id":229,"ws":true},{"text":"team","start":1449,"end":1453,"id":230,"ws":true},{"text":"members","start":1454,"end":1461,"id":231,"ws":true},{"text":"with","start":1462,"end":1466,"id":232,"ws":true},{"text":"weekly","start":1467,"end":1473,"id":233,"ws":true},{"text":"team","start":1474,"end":1478,"id":234,"ws":true},{"text":"activities","start":1479,"end":1489,"id":235,"ws":true},{"text":"incentives","start":1490,"end":1500,"id":236,"ws":true},{"text":"for","start":1501,"end":1504,"id":237,"ws":true},{"text":"growth","start":1505,"end":1511,"id":238,"ws":false},{"text":",","start":1511,"end":1512,"id":239,"ws":true},{"text":"travel","start":1513,"end":1519,"id":240,"ws":false},{"text":",","start":1519,"end":1520,"id":241,"ws":true},{"text":"and","start":1521,"end":1524,"id":242,"ws":true},{"text":"bonuses","start":1525,"end":1532,"id":243,"ws":true},{"text":"rewards","start":1533,"end":1540,"id":244,"ws":true},{"text":"and","start":1541,"end":1544,"id":245,"ws":true},{"text":"recognition","start":1545,"end":1556,"id":246,"ws":true},{"text":"for","start":1557,"end":1560,"id":247,"ws":true},{"text":"top","start":1561,"end":1564,"id":248,"ws":true},{"text":"performance","start":1565,"end":1576,"id":249,"ws":true},{"text":"salary","start":1577,"end":1583,"id":250,"ws":false},{"text":":","start":1583,"end":1584,"id":251,"ws":true},{"text":"$","start":1585,"end":1586,"id":252,"ws":false},{"text":"36,500.00","start":1586,"end":1595,"id":253,"ws":true},{"text":"to","start":1596,"end":1598,"id":254,"ws":true},{"text":"$","start":1599,"end":1600,"id":255,"ws":false},{"text":"54,600.00","start":1600,"end":1609,"id":256,"ws":true},{"text":"/year","start":1610,"end":1615,"id":257,"ws":false}],"_view_id":"ner_manual","spans":[{"start":482,"end":501,"token_start":77,"token_end":78,"label":"HARDSKILL"},{"start":801,"end":830,"token_start":136,"token_end":138,"label":"SOFTSKILL"},{"start":930,"end":958,"token_start":153,"token_end":155,"label":"SOFTSKILL"},{"start":972,"end":994,"token_start":160,"token_end":161,"label":"SOFTSKILL"},{"start":1039,"end":1064,"token_start":168,"token_end":171,"label":"SOFTSKILL"},{"start":1104,"end":1113,"token_start":178,"token_end":178,"label":"SOFTSKILL"},{"start":1169,"end":1182,"token_start":187,"token_end":187,"label":"SOFTSKILL"},{"start":1221,"end":1231,"token_start":194,"token_end":194,"label":"SOFTSKILL"},{"start":1236,"end":1244,"token_start":196,"token_end":196,"label":"SOFTSKILL"},{"start":1272,"end":1288,"token_start":201,"token_end":202,"label":"SOFTSKILL"},{"start":1297,"end":1304,"token_start":205,"token_end":205,"label":"SOFTSKILL"},{"start":1310,"end":1320,"token_start":208,"token_end":208,"label":"SOFTSKILL"},{"start":1474,"end":1489,"token_start":234,"token_end":235,"label":"SOFTSKILL"}],"answer":"accept","_timestamp":1647615623}
I wonder if it's perhaps easier to detect skills in general first and to worry about the type of skill in a post-processing step. It seems, but I could be wrong, that these skills are mutually exclusive which suggests that they could be easily mapped to a subclass. But that's a side-track.
Without knowing the details of your task it does occur to me that if the task is to detect things in a job posting, then it makes sense to have the unit of work be the job post. That way, things like lists of items can be part of the same document, which I imagine might be a theme. Is there a reason why it's hard to work with a full job post as a unit?
If I am understanding correctly, are you suggesting to just "extract a single entity" (say, 'SKILL') and "then split the results based in some corpus of each category"? If so, we have already a project which does exactly that, with an accuracy of ~86%. However, the challenge with that approach is the "corpus building": it must be manually updated from time time, and therefore is a highly time-consuming task. That's why we decided to experiment with another model "doing the heavy lifting for us". Feel free to include any correction if my grasping was deviating from your sugestion.
Is there a reason why it's hard to work with a full job post as a unit?
Not at all, we have scrapped hundreds of thousands of job postings (similar to the example provided), so the only bottleneck now (as mentioned in some previous post) is the manual labeling itself. As you saw in the results from train-curve command, we are currently needing way more labeled samples (which BTW is what we are currently doing).
I think I will return after some more labeled samples have been collected, but if you have another comment or idea, we are all ears.
Can you provide more explanation on this bottleneck?
Does your training data originate from one source (eg 1 website, 1 system of record) or does it come from heterogeneous (different) data sources which could change in the future?
I suspect it is the second and that’s what you mean by “corpus building” for each data source. But I suspect this would still be an issue even for predicting type of skill too.
I think it would add nothing for this case, as we are developing a new model from scratch, so "digging in details of a soon-to-be obsolete model", seems a bit useless to me. Again, we are designing a NER model, where several labels must be identified, and the particularity of the input texts, is that they are "long ones" (i.e. ~3k characters or ~425 words each).
For the "brand new model" we are designing right now (again, let's focus on this one), all the texts come from the same source (job postings scrapped from LinkedIn and another similar websites), and therefore "are similar to what the model will eventually deal with".
Hello @koaning , hope this message finds you well.
After labeling some more texts, we have run some tests, and some conclusions can be extracted from those:
The benchmark obtained via train-curve so far, is slightly above "0.55".
Past ~2000 labeled texts, the model "stops learning", as adding more texts does not (seem to) improve performance.
Questions:
I see that in train-curve, there is this --base-model parameter, which by default is set to None. If the command ran was prodigy train-curve --ner station1_job1, the pipeline used by train-curvewould contain ['tok2vec', 'ner'] only, or what is the pipeline used? (This question is important to make sure that the whole "job posting" text is processed, and there is no sentencizing, or similar).
Provided "job posting" text length is kept, and knowing that increasing the amount of labeled texts does not seem as a promising approach, to obtain a better model performance (at least by the results obtained via train-curve), what could be other hints to improve the model? (BTW, I am basically following the hints provided in Ines Montani video; however in that video, other than adding new labeled samples, I cannot see any "hyperparameter tuning" or something similar that I could play around with).
Speaking of "hyperparameter tuning", would there be any benefit in training the model using spaCy? (At least by what is shown here, apparently "using Prodigy" is considered as a "different model training method" than "using spaCy" but, other than the syntax, I don't understand why they would be considered "different").
I think we can start working on these 3 questions, to see what else could be done.
When you don't specify a --base-model then a blank model from the associated language (via --lang) will be used. These blank models do come with their own tokeniser and sentenciser out of the box.
You can play around with hyperparameters by providing alternative config files via the --config setting. You can generate one on the spaCy docs if you like. Note that there's a setting for efficiency (which is the default) and another one for accuracy.
You can do more elaborate customisations by interfacing with spaCy directly, but I would only consider that if you've gotten a good reason to go there. At the moment I would just keep it at changing your configuration file and interact with prodigy that way.
Have you look at the kinds of mistakes that your model makes? Is the a specific label that has bad performance? Understanding this better might also help inspire us towards a better model.