Additional Information About Uploading Files
Uploading Files via the Web Interface
The default method of uploading files in Dataverse is by using the web interface. You can upload files by choosing one or multiple files from your hard drive and clicking “Open”, or by dragging and dropping the files into the file upload area. You can also use the DropBox integration and upload files directly from your DropBox account.
Uploading files will add them all at the same level in the dataset. However if you have a large file list, having them organized into folders can help you direct users to important documentation or assist navigation. You may also have dependencies that require them to be organized into a folder structure. The best way to reproduce them is by uploading them as a compressed ZIP file or files.
Any individual file must be under 4GB. However, file uploads over 2GB, and for some below that threshold, may be slow or stall due to variables outside of ASU’s control. If you have data files over 4GB, we will consider support options on a case-by-case basis and consult with your library liaison. Please use our Contact Researcher Support form if you have trouble uploading files.
If you have a large number of files to upload, you can compress the files and upload the ZIP file. Once the file is uploaded, it will be extracted, and the individual files will appear in your file list. You must double-zip the file if you want to upload compressed files and keep them compressed as a ZIP file when using the web interface. The double-zipped file will be extracted, leaving only the original ZIP file.
The best practice when uploading compressed files from the web interface is to create a folder structure and then compress the files at the level of the directory structure relative to where you want subdirectories to appear.
For example, the following is a sample folder structure (in Windows) for a project:
You can compress these folders into a single ZIP file (view recording).
Then, upload the ZIP file to the Dataverse dataset (view recording).
The files will be extracted, resulting in a tree structure in Dataverse looks like this:
NOTE: the ZIP file in the RawData folder was not extracted because it was double-zipped (a ZIP within a ZIP).
NOTE: There is no limit to the size of a ZIP file to be uploaded, but there is a limit to the number of files that can be included, and that is 1000 files.
Uploading using Direct Upload
For most file uploads, utilizing the user interface will be adequate and work well. Normal file uploads use a portion of available network bandwidth, as well as some temporary storage on the user’s computer (or the server) to upload files. With larger files, these resources can be insufficient and the upload can fail.
If the users uploading large files (ie: greater than 3G), it may be necessary to take advantage of AWS mutli-part upload functionality to assist with the upload success. This allows large files to be broken into smaller pieces to be uploaded, which eliminates many of the resource issues that can occur when trying to upload a larger file. This would involve a configuration of the specific dataset to use the direct upload capability when uploading files.
It is important for our team to work with you to assess your files and determine the best option for uploading files will be for your project.
NOTE: If direct upload is configured for your dataset file store, files that are compressed to ZIP format will not extract upon uploading, as the “Upload files via the web interface” section of this document specified.
For more information on direct upload, see: Big Data Support — Dataverse.org
Uploading Files via DVWebLoader
DVWebLoader is a small web application that can be configured with Dataverse to allow upload of a whole directory/folder tree of files into a Dataverse dataset, retaining their relative paths within the directory/folder in the dataset. Before uploading, DVWebLoader will check the dataset contents and will, by default, not upload files that already exist in it. Users can modify the default selection by checking/unchecking specific files before initiating the upload. DVWebloader currently works with S3 stores with direct upload enabled and will not work with other types of stores in Dataverse.
You do not need to double-zip files to keep them compressed as ZIP files when using DVWebloader.
To upload files into a dataset using DVWebLoader, perform the following steps:
Go to the file upload page in the dataset by clicking the "Upload Files" button from the main landing page of the dataset.
You will see a page that gives various methods for uploading. Choose the "Upload a Folder" button.
A screen will appear with a button allowing you to “Select a Directory”. Click the button and choose a folder on your local device to upload to the dataset.
You will not see the files within the folder when choosing it, but they will appear on the file transfer page. Any files that are already uploaded to the dataset will be unchecked. If you want to overwrite the existing file, click the box to upload it.
Click the "Start Uploads" button. You will see a progress bar indicating the status of the uploaded files. When the file upload is complete, a green status will appear indicating that the files have been uploaded.
Uploading Files via the Command Line
The Dataverse community has provided a tool called DVUploader to help with uploading datasets with many files or with large files.
The DVUploader is a Java application packaged as a single jar file. Visit the tool's official home in GitHub for the current version and more detailed documentation. This application allows for uploading a directory or folder structure from the command line.
NOTE: You must have Java 8 or above installed in order to run the java application.
You can place the DVUploader .jar file at the level of the directory structure relative to where you want subdirectories to appear. Then you can run the tool using the following syntax:
java -jar DVUploader-v1.1.0.jar -recurse -directupload -key=<your api token> -did=doi:10.48349/ASU/<your dataset code> -server=https://dataverse.asu.edu <directory path>
<your api token> - an API token can be obtained from your Dataverse account profile
<your dataset code> - the last 6 digits of the DOI of your dataset
<directory path> - the path on your system pointing to the files
For example, the following is a sample folder structure (in Linux) for a project:
You can place the DVUploader tool at the same level at the level of the directory structure relative to where you want subdirectories to appear.
You can then run the DVUploader command:
java -jar DVUploader-v1.1.0.jar -recurse -key=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -did=doi:10.xxxxx/ASU/xxxxxx -server=https://dataverse.asu.edu testfiles
The resulting structure (tree view) in the dataset looks like this:
View a recording of this process.
NOTE: the .ZIP file in the RawData folder extracted the files out to ndividual files!
If you have large files (> 3G), you should use the -directupload flag. This will allow the API to utilize the multi-part upload feature of S3, which divides the large files into parts to upload.
The command will look like this:
java -jar DVUploader-v1.1.0.jar -recurse -directupload -key=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -did=doi:10.xxxxx/ASU/xxxxxx -server=https://dataverse.asu.edu testfiles
NOTE: if you use the -directupload flag, ZIP files will not be extracted!
Ingesting Tabular Files
When files of type XLSX, CSV, R, STATA, or SPSS are uploaded to Dataverse, an ingest process analyzes the file and converts it to a simple tabular format that can be read by analysis tools. Dataverse does its best to strip the important data from the original file format into columns of data that can be read by the data curation tools. The original copy of the file is retainedl, and users can download either version of the file.
When converting comma-separated (CSV) files and spreadsheets, there are formatting requirements that must be met in order for the ingest to occur. If these are not available, the file will not be “ingested” and will remain in its original format.
For more information on the tabular file ingest process, see: Tabular Data, Representation, Storage and Ingest — Dataverse.org
Uploading Files via Globus File Transfer
Globus is a file transfer tool that optimizes the uploading and downloading of large or numerous files between endpoints. Globus allows users to define “endpoints” from a variety of platforms and devices, and to transfer files between them, making the process of populating a dataset much more efficient. An enterprise installation of Globus is in use at Arizona State University and funded by the ASU Knowledge Enterprise (KE) Research Technology Office (RTO). This method of file transfer utilizes the Dataverse Globus transfer app that was developed by Scholar’s Portal/Borealis specifically for the Dataverse platform.
To access the Globus upload option, the user must have the Globus Connect Personal application installed on their computer. They may have it running on multiple devices and will be able to access any location where it is running from their local device.
A user may transfer collections of materials using Globus to and from Dataverse. The user must have “write” access to the collections to see them in the Dataverse Globus integration console. Write access is given within Globus by whoever creates the collection so that will need to be coordinated independently.
The following document outlines the steps for using the Globus transfer upload method from within the Research Data Repository.
You do not need to double-zip files to keep them compressed as ZIP files when using Globus File Transfer.
Uploading Files Using Google Cloud Storage
The following are steps that you can use to bulk upload files located in Google Cloud Storage directly to Dataverse. This general process has been tested and was successful in our environment, but we can’t guarantee that it will work for every installation. For this process you will need the gcsfuse software to be copied/installed to your Linux server.
Login to ASU Library Research Data Repository and obtain an API token (in the dropdown menu under the account name)
Access a server in the region your storage is in (for best performance)
Mount your google cloud bucket using gcsfuse
Use one of the above upload methods to transfer files from Google Cloud Storage to Dataverse