What is the best way to load a dataset of complex JSON documents for annotating?

I have a cyber security related dataset containing thousands of complex JSON documents. A typical example:

{
	'reportId': '09-87654321', 
	'title': 'Indicator Report: TrickBot Activity Report (Apr 17, 2009)', 
	'ThreatScape': ['Cyber Crime'], 
	'audience': ['Operational'], 
	'publishDate': 1234567890, 
	'version': '1', 
	'version1PublishDate': 1234567890, 
	'intelligenceType': 'malware', 
	'reportType': 'Indicator Report', 
	'report_details': {
		'reportId': '09-87654321', 
		'title': 'Indicator Report: TrickBot Activity Report (Apr 17, 2009)', 
		'execSummary': "<p>TrickBot (aka TrickLoader) is a banking Trojan with the capability to facilitate fraud by way of capturing users' login credentials for identified banking entities, conduct man-in-the-middle (MitM) sessions, and deliver scripts to the victim that can perform a variety of tasks, including inciting the user to enter personally identifiable information (PII) and passwords.</p>", 
		'ThreatScape': {'product': ['ThreatScape Cyber Crime']}, 
		'audience': ['Operational'], 
		'publishDate': 'April 17, 2009 17:45:00 PM', 
		'version': '1', 
		'reportType': 'Indicator Report', 
		'analysis': '<div><p>Many similarities between TrickBot and the now-defunct Dyre banking Trojan exist, including web-inject types, code structure, and check-in types for command and control (C&amp;C) communications. Like Dyre, TrickBot also uses compromised infrastructure for some of the C&amp;C communications. Although we believe that the two projects (Dyre and TrickBot) are related, the code for TrickBot has been completely rewritten so we do not believe this is a new variant, but instead its own code family.</p><p>The TrickBot indicators in the report include controller URLs and nodes. These nodes or controllers are used for command and control communications, downloads, and configuration downloads (including injects). IOCs marked as "Attacker" are purely malicious. Any IOC marked as "Compromised" has been malicious at one point but may have been remediated. IP addresses marked as "Related" often include controller nodes hosted on legitimate infrastructure, possibly containing hundreds or thousands of additional hosts.</p></div>', 
		'previousVersionSection': {
			'previousVersion': [
				{
					'versionNumber': '1.0', 
					'title': 'Indicator Report: TrickBot Activity Report (Apr 17, 2009)', 
					'publishDate': 'April 17, 2009 17:45:00 PM'
				}
			]
		}, 
		'version1PublishDate': 'April 17, 2009 17:45:00 PM', 
		'tagSection': {
			'main': {
				'targetGeographies': { 'targetGeography': ['Global'] }, 
				'intendedEffects': {
					'intendedEffect': [
						'Financial Theft (Fraud/Extortion/Financial Loss/Stealing Money)', 
						'IP or Confidential Business Information Theft', 
						'Credential Theft/Account Takeover', 
						'Financial Theft'
					]
				}, 
				'affectedSystems': {'affectedSystem': ['Users/Application and Software']}, 
				'motivations': {'motivation': ['Financial or Economic']}, 
				'targetedInformations': {'targetedInformation': ['Customer Data', 'Financial Data', 'Credentials']}, 
				'ttps': {'ttp': ['Malware Propagation and Deployment', 'Fraud']}, 
				'malwareFamilies': {'malwareFamily': [{'name': 'trickbot'}]}
			}, 
		'riskRating': 'LOW', 
		'intelligenceType': 'malware'
	}
}

I'd like to build a NER model with this dataset. In order to do that, I need to label entities like malware and organization (affected by malware).

What is the best way to load this dataset in Prodigy for annotating?
What I have tried resulted in:

  1. one multi-line JSON document was cut into many random small segments; or
  2. one large JSON document rendered into a super long one-line doc

How can I load the dataset so that each complex JSON document can be rendered in Prodigy with proper JSON format?

Hi @howardBayes ,

If I understand correctly, you need the whole JSON file and its corresponding structure to show up in the Prodigy interface? Or do you just need a specific text from that JSON file and label some of its spans?

If it's the former, you can perhaps use a custom interface so that you can display the nestedness properly. You can try adding a <pre> tag to preserve them as code blocks. If it's the latter, you first need to determine the text you want to display in Prodigy. Then, you have to create a JSONL file to load those texts for annotation.

Hi @ljvmiranda921 ,

Thank you for your reply. Definitely, I will try the custom interface.

The schema of the JSON documents is unknown since the data come from various sources. It is not practical to assume a single or a handful of JSON schema.