Hi, apologies if this has already been asked elsewhere. From a txt. file, I want to create a .jsonl file to be annotated with NER manual.
But, I would like to pre-populate some of the words with annotations (the obvious ones etc). I was thinking that I would need to load it in a very similar format to a db-out .jsonl file, but wondered what fields/keys I absolutely need. So I presume that I need text:, and spans:? What about token: and the input_hash, etc etc?
Also is there any guidance on the best way to do this with a .txt file? Using Python?
Thank you!
Yes, exactly! Prodigy's input and output formats are essentially the same, so you can load in data in the same format as the data you export with db-out.
The only field that's absolutely needed for NER is "text" β and of course "spans" if you want to pre-populate the entities. If no "tokens" are provided, the ner.manual recipe will tokenize the text with spaCy, add the tokens and map the existing "spans" to the tokens automatically. This typically works fine, but it can lead to problems if the pre-populated spans don't map to valid tokens. For example, let's say you have a text like U.S. and a pre-defined span for U., but the default tokenizer will keep U.S. as a single token. That wouldn't be valid. So if you're working with custom tokenization, you can provide it via the "tokens" key and Prodigy will respect it.
The other keys like the hashes etc. will all be added automatically. You can, in theory, pre-define them (e.g. if you want to implement your own custom hashing), but it's typically not needed for most use cases.
Do you already have the annotations that you want to add and if so, how are they formatted? Or do you have lists of words or phrases you know are always going to be a certain entity?
Basically, assuming your .txt file looks like this:
This is a text about Facebook.
The result you want to create for this example could be:
{
"text": "This is a text about Facebook.",
"spans": [{"start": 21, "end": 29, "label": "ORG"}]
}
If you only have word lists that include words like "Facebook", one option would be to use spaCy's matchers or even just simple regular expressions. You can do it in Python, or (technically) any other language, tool or library that lets you create JSON-formatted data. (Instead of a .jsonl file, you can also load a .json file into Prodigy, if you find that easier to work with.)
If one have multiple txt files (It's a dataset of 1000+ paper converted from pdf to txt).
Would you advise to convert into a single JSON/JSONL (like done here) or stay with origin files structure?
Are there some functions to create that labeled jsonl? (with the match labels for e.g: "spans": [{"start": 21, "end": 29, "label": "ORG"}])