How Do I Use SamRand?

SamRand supports two modes of use: - as a standalone application, and - as a module within your python script.

What Should My Dataset Look Like?

Right now, SamRand supports two types of datasets, CSV files and JSON files. For now, CSV files are expected to use commas as delimiters, with double quotes around text (default python CSV settings). JSON files are expected to be valid. Examples of both dataset types are included in the test folder of this repository.

As a standalone application

Once installed, you can use SamRand as a standalone application in your terminal of choice. It supports the following arguments:

-h, --help: Shows a help message and exits.
--dataset : The file containing your dataset.
--size : The required sample size (n).
--header: When using a CSV dataset file, use this flag to indicate whether the first row is a header.
--replacement: Extract samples with replacement. Not including this flag means without replacement (the default behavior).
--stratify: Balance the extracted sample so that it reflects the population's distribution.
--strata '[0, 1, 2, ...]': When using stratification, use this parameter to indicate which fields should be used as a basis for stratification. Accepts valid JSON arrays of column indices starting with 0.
--output: The output format of the samples. Default is JSON. Can be one of [CSV|JSON].

A typical command using SamRand looks like the following example that samples a CSV dataset with a header for 30 samples, then outputs the sample to stdout in CSV format:

$ SamRand --dataset datasets/dataset.csv \
--size 30 \
--header \
--stratify \
--strata '[4, 5]' \
--output CSV

To output the results somewhere other than stdout, redirect the output to a file depending on your terminal emulator. For instance, when redirecting the above command's output to a CSV file in a standard bash session:

$ SamRand --dataset datasets/dataset.csv \
--size 30 \
--header \
--stratify \
--strata '[4, 5]' \
--output CSV > output.csv

As a Python module

You can build a python script and use SamRand within it to sample datasets on the fly to do with as you please. For instance, if you wanted to sample a dataset in your python script, you would import SamRand as a dependency, and give it the necessary information:

import samrand as sr

dataset_path = '/path/to/my/dataset.json'
dataset = sr.reader.read_json(dataset_path)
sample = sr.sampler.sample(dataset, 30, stratify=True, replacement=True)

Further documentation can be found here.