Advanced Dataset and Metadata Editing

The Dataverse User Interface provides an easy means of modifying the dataset descriptions, files, and metadata. However, if many files in a dataset need a specific modification, or if you are developing an application that requires API access to the dataset or files, you may need a way to access the dataset and file information from a command line. For example, you may need to pull descriptive information about multiple datasets or files within a dataset. Or, you can cycle through all of the files in the dataset and add tags or prefixes to each file. It would be time-consuming to edit each file individually, and the API provides calls that will allow retrieval of that information more easily. It can also provide the information in JSON format, which is readily used by most scripting options.

Using the Dataverse Native API

Dataverse provides an extensive REST-based Native API that allows users to perform many of the same tasks in the GUI from the command line. These API calls can perform tasks such as creating collections, viewing the contents of a collection, and listing other information about a collection.

API commands can also be used in a script or program to automate repetitive functions such as downloading files, editing/replacing files, adding tags, and changing file metadata.

Examples:

Dataverse Toolbox Python scripts provided by Don Sizemore of the Odum Institute

Using PyDataverse Python Module for Dataverse

PyDataverse is a Python module for Dataverse you can use for:

  • accessing the Dataverse API’s

  • manipulating and using the Dataverse (meta)data - Dataverses, Datasets, Datafiles

PyDataverse allows you to utilize the API to create collections, upload and download files from a collection, and retrieve all of the metadata related to a collection and its files. This tool also allows for extensive metadata editing via CSV templates with pre-defined data formats and utilization of API calls to edit specific metadata fields in a dataset or file (i.e., tags, description, etc.).

Example:

The code below is an example of a program one of our depositors wrote to cycle through a list of files in a dataset and assign specific tags to each file. This program uses PyDataverse libraries that utilize the NativeAPI and DataAccessAPI from Dataverse to allow the addition of a tag value based on the size of a thumbnail image.

# Reference: https://pydataverse.readthedocs.io/en/latest/reference.html from pyDataverse.api import NativeApi, DataAccessApi import json from lxml import etree import os DATAVERSE = "https://example.dataverse.org" DATASET_DOI = "doi:10.5072/FK2/XXXXXX" # Use environment variable for API TOKEN # export APITOKEN=.... apitoken = os.environ['APITOKEN'] api = NativeApi(DATAVERSE, apitoken) dataset = api.get_dataset(DATASET_DOI, auth=apitoken) data_api = DataAccessApi(DATAVERSE, apitoken) donor_dict = {} # key: donor id, value: life cycle files_list = dataset.json()['data']['latestVersion']['files'] # creates a dict key donorID and value donorLifeStage from the xml files xml_files = [ f for f in files_list if "xml" in f["dataFile"]["filename"] ] for x in xml_files: xml_content = data_api.get_datafile(x["dataFile"]["id"]).text.encode('utf-8') xml_root = etree.fromstring(xml_content) donor_id = xml_root.find(".//{http://mother-db.org/mdb}donorID").text donor_life_stage = xml_root.find(".//{http://mother-db.org/mdb}donorLifeStage").text if donor_id not in donor_dict: donor_dict[donor_id] = donor_life_stage print(donor_dict) for file in files_list: file_name = file["dataFile"]["filename"] file_id = file["dataFile"]["id"] print(file_name, file_id) # get_datafile_metadata(identifier, is_filepid=False, is_draft=False, auth=True) file_metadata = api.get_datafile_metadata(file_id, is_draft=True) file_metadata = file_metadata.json() if "ome.tif" in file_metadata['label']: file_metadata['description'] = "Full Resolution Image" elif "reduced" in file_metadata['label']: file_metadata['description'] = "Reduced Image" elif "thumbnail" in file_metadata['label']: file_metadata['description'] = "Thumbnail Image" elif "xml" in file_metadata['label']: file_metadata['description'] = "Metadata" else: pass for d in donor_dict: if d in file_metadata['label']: # add lifeStage as tag file_metadata['categories'] = [ donor_dict[d] ] # add donor as directoryLabel file_metadata['directoryLabel'] = d # update_datafile_metadata(identifier, json_str=None, is_filepid=False) file_metadata_json_str = json.dumps(file_metadata) print(file_metadata_json_str) #update_response = api.update_datafile_metadata(file_id, json_str=file_metadata_json_str) #update_response.check_returncode()