What is SamRand

SamRand is a tool designed to sample datasets and produce statistically representative samples for research purposes.

Who is this for?

I developed this primarily for researchers who deal with large datasets, and need to sample them to conduct some qualitative analysis. While it was meant to pull samples for qualitative purposes, it could be used for quantitative purposes as well. And even though, at the time, it was meant for researchers, there is no reason to believe that it cannot be used for non-research use-cases. As such, this project is licensed under the MIT license.

How does SamRand sample a dataset?

SamRand's sampling approach differs depending on the settings you use when sampling a dataset. These are, however, dependent on the choice of stratification:

No Stratification: SamRand will select rows from the dataset at random without attempting to represent any existing groups within the dataset's population.
Stratification with Unknown Dimensions: SamRand will perform a single-level clustering along the dimension with the least variance (to guarantee diverse strata). Samples are pulled from these two strata based on their proportion to the dataset's distribution. For instance, a dataset with location as the dimension with the least variance (either X or Y with a 60:40 split) will generate a sample of 6 rows in location X and 4 in location Y if the sample size is 10.
Stratification with known Dimensions: If you provide specific dimensions (column indices) when invoking SamRand, it will apply multi-level clustering to generate strata. This means it will split the data by the first dimension, then split the strata resulting from the first split by the second dimension, and so on.

Important Note: Depending on how your dataset is distributed, it is possible that there will be strata with only a single row. SamRand will extract at least one row from each strata. This will inflate sample size, resulting in a sample size larger than what you specified. To reconcile the difference, SamRand will (once it has a representative sample) remove rows at random from that sample until it shrinks down to the desired size. Consequently, rows from larger strata have a higher probability to be removed towards the end of the sampling process.

There is also whether you choose to sample with or without replacement:

with Replacement: Rows previously sampled may be sampled again. Which means that the dataset may consist of duplicate rows.
without Replacement: Rows previously sampled may not be sampled again. which means that the dataset will not contain duplicates unless the dataset itself contains duplicates.

If there is a sampling strategy you'd like to see implemented or fixed, feel free to open an issue. I will try to get around to it. Alternatively, you can submit a merge request. Stay up-to-date by monitoring SamRand's issues page.