Advanced Dataset and Metadata Editing

1 Using the Dataverse Native API
2 Using PyDataverse Python Module for Dataverse

The Dataverse User Interface provides an easy means of modifying the dataset descriptions, files, and metadata. However, if many files in a dataset need a specific modification, or if you are developing an application that requires API access to the dataset or files, you may need a way to access the dataset and file information from a command line. For example, you may need to pull descriptive information about multiple datasets or files within a dataset. Or, you can cycle through all of the files in the dataset and add tags or prefixes to each file. It would be time-consuming to edit each file individually, and the API provides calls that will allow retrieval of that information more easily. It can also provide the information in JSON format, which is readily used by most scripting options.

Using the Dataverse Native API

Dataverse provides an extensive REST-based Native API that allows users to perform many of the same tasks in the GUI from the command line. These API calls can perform tasks such as creating collections, viewing the contents of a collection, and listing other information about a collection.

API commands can also be used in a script or program to automate repetitive functions such as downloading files, editing/replacing files, adding tags, and changing file metadata.

Examples:

Dataverse Toolbox Python scripts provided by Don Sizemore of the Odum Institute

Using PyDataverse Python Module for Dataverse

PyDataverse is a Python module for Dataverse you can use for:

accessing the Dataverse API’s
manipulating and using the Dataverse (meta)data - Dataverses, Datasets, Datafiles

PyDataverse allows you to utilize the API to create collections, upload and download files from a collection, and retrieve all of the metadata related to a collection and its files. This tool also allows for extensive metadata editing via CSV templates with pre-defined data formats and utilization of API calls to edit specific metadata fields in a dataset or file (i.e., tags, description, etc.).

Example:

The code below is an example of a program one of our depositors wrote to cycle through a list of files in a dataset and assign specific tags to each file. This program uses PyDataverse libraries that utilize the NativeAPI and DataAccessAPI from Dataverse to allow the addition of a tag value based on the size of a thumbnail image.

# Reference: https://pydataverse.readthedocs.io/en/latest/reference.html
from pyDataverse.api import NativeApi, DataAccessApi
import json
from lxml import etree
import os

DATAVERSE = "https://example.dataverse.org" 
DATASET_DOI = "doi:10.5072/FK2/XXXXXX"

# Use environment variable for API TOKEN
# export APITOKEN=....
apitoken = os.environ['APITOKEN']
api = NativeApi(DATAVERSE, apitoken)
dataset = api.get_dataset(DATASET_DOI, auth=apitoken)
data_api = DataAccessApi(DATAVERSE, apitoken)

donor_dict = {} # key: donor id, value: life cycle 

files_list = dataset.json()['data']['latestVersion']['files']

# creates a dict key donorID and value donorLifeStage from the xml files 
xml_files = [ f for f in files_list if "xml" in f["dataFile"]["filename"] ]
for x in xml_files:
    xml_content = data_api.get_datafile(x["dataFile"]["id"]).text.encode('utf-8')
    xml_root = etree.fromstring(xml_content)
    donor_id = xml_root.find(".//{http://mother-db.org/mdb}donorID").text
    donor_life_stage = xml_root.find(".//{http://mother-db.org/mdb}donorLifeStage").text
    if donor_id not in donor_dict:
        donor_dict[donor_id] = donor_life_stage
print(donor_dict)

for file in files_list:
    file_name = file["dataFile"]["filename"]
    file_id = file["dataFile"]["id"]
    print(file_name, file_id)
    # get_datafile_metadata(identifier, is_filepid=False, is_draft=False, auth=True)
    file_metadata = api.get_datafile_metadata(file_id, is_draft=True)
    file_metadata = file_metadata.json() 
    if "ome.tif" in file_metadata['label']:
        file_metadata['description'] = "Full Resolution Image"
    elif "reduced" in file_metadata['label']:
        file_metadata['description'] = "Reduced Image"
    elif "thumbnail" in file_metadata['label']:
        file_metadata['description'] = "Thumbnail Image"
    elif "xml" in file_metadata['label']:
        file_metadata['description'] = "Metadata"
    else:
        pass

    for d in donor_dict:
        if d in file_metadata['label']:
        # add lifeStage as tag
            file_metadata['categories'] = [ donor_dict[d] ]
        # add donor as directoryLabel
            file_metadata['directoryLabel'] = d
            
    # update_datafile_metadata(identifier, json_str=None, is_filepid=False)
    file_metadata_json_str = json.dumps(file_metadata) 
    print(file_metadata_json_str)
    #update_response = api.update_datafile_metadata(file_id, json_str=file_metadata_json_str)
    #update_response.check_returncode()

Previous: Adding file tags

Next: Using the README template