DataFrame objects are containers for holding tabulated data. They inherent collections.OrderedDict and hold data in numpy.array objects.
Methods to get data into a DataFrame, manipulate and manage data, and write data. For the most part these remove the user from the fact they are using sqlite3 to manipulate data.
initialize a DataFrame object.
assign a column in the table
- args:
key: hashable object to associate with item
item: an iterable that is put in an np.array or np.ma.array
- kwds:
- mask: mask value passed to np.ma.MaskedArray.__init__()
- returns:
- None
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> import numpy as np >>> df['log10(age)'] = np.log10(df['age']) >>> print(df) first last age gender log10(age) =============================================== Roger Lew 28 male 1.447 Bosco Robinson 5 male 0.699 Megan Whittington 26 female 1.415 John Smith 51 male 1.708 Jane Doe 49 female 1.690 >>>
delete a column from the table
- args:
- key: associated with the item to delete
- returns:
- None
- example:
>>> ... >>> print(df) first last age gender log10(age) =============================================== Roger Lew 28 male 1.447 Bosco Robinson 5 male 0.699 Megan Whittington 26 female 1.415 John Smith 51 male 1.708 Jane Doe 49 female 1.690 >>> del df['log10(age)'] >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>>
returns human friendly string representation of object
- args:
- None
- returns:
- string with easy to read representation of table
loads tabulated data from a plain text file
- args:
- fname: path and name of datafile
- kwds:
skip: number of lines to skip before looking for column labels. (default = 0)
delimiter: string to seperate values (default = “’”)
labels: bool specifiying whether first row (after skip) contains labels. (default = True)
- returns:
- None
attaches a second DataFrame to self
insert a row into the table
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> print(df)
first last age gender
==================================
Roger Lew 28 male
Bosco Robinson 5 male
Megan Whittington 26 female
>>>
sort the table in-place
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> df.insert(Person('John', 'Smith', 51, 'male')._asdict())
>>> df.insert(Person('Jane', 'Doe', 49, 'female')._asdict())
>>> df.sort(['gender', 'age'])
>>> print(df)
first last age gender
==================================
Megan Whittington 26 female
Jane Doe 49 female
Bosco Robinson 5 male
Roger Lew 28 male
John Smith 51 male
>>>
determines rows in table that satisfy the conditions given by where and returns the values of key in the remaining rows
- args:
- key: column label of data to return
- kwds:
- where: constraints to apply to table before returning data
- returns:
- a list
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> df.select_col('age', where='gender == "male"') [28, 5, 51] >>>
iterate over the rows in table
>>> print(df)
first last age gender
==================================
Roger Lew 28 male
Bosco Robinson 5 male
Megan Whittington 26 male
John Smith 51 female
Jane Doe 49 female
>>> for case in df.row_iter():
print(case)
OrderedDict([('first', 'Roger'), ('last', 'Lew'), ('age', 28), ('gender', 'male')])
OrderedDict([('first', 'Bosco'), ('last', 'Robinson'), ('age', 5), ('gender', 'male')])
OrderedDict([('first', 'Megan'), ('last', 'Whittington'), ('age', 26), ('gender', 'male')])
OrderedDict([('first', 'John'), ('last', 'Smith'), ('age', 51), ('gender', 'female')])
OrderedDict([('first', 'Jane'), ('last', 'Doe'), ('age', 49), ('gender', 'female')])
>>>
produces a contingency table according to the arguments and keywords provided.
- args:
- val: the colname to place as the data in the table
- kwds:
- rows: list of colnames whos combinations will become rows
- in the table if left blank their will be one row
- cols: list of colnames whos combinations will become cols
- in the table if left blank their will be one col
- aggregate: function applied across data going into each cell
- of the table <http://www.sqlite.org/lang_aggfunc.html>_
where: list of tuples or list of strings for filtering data
- method:
‘valid’: only returns rows or columns with valid entries.
- ‘full’: return full factorial combinations of the
- conditions specified by rows and cols
- returns:
- PyvtTbl object
Applies the where filter to a copy of the DataFrame, and returns the new DataFrame. The associated DataFrame is not copied.
- args:
- where: criterion to apply to new table
- returns:
- a new DataFrame
- example:
>>> ... >>> print(df) first last age gender ================================== Roger Lew 28 male Bosco Robinson 5 male Megan Whittington 26 female John Smith 51 male Jane Doe 49 female >>> print(df.where('age > 20 and age < 45')) first last age gender ================================== Roger Lew 28 male Megan Whittington 26 female >>>
Applies the where filter in-place.
prints the descriptive information for each column in DataFrame
validate the data in the table.
True: provide real-time feedback
False: don’t provide feedback (default)
True: print a report upon completion
False: don’t print report (default)
True: the criteria was satisfied
False: the critera was not satisfied
>>> ...
>>> print(df)
first last age gender
==================================
Roger Lew 28 male
Bosco Robinson 5 male
Megan Whittington 26 female
John Smith 51 male
Jane Doe 49 female
>>> def isint(x):
try : return int(x)-float(x)==0
except: return False
>>> df.validate({'age' : lambda x: isint(x),
'gender' : lambda x: x in ['male', 'female']},
verbose=True, report=True)
Validating gender:
.....
Validating age:
.....
Report:
Values tested: 10
Values passed: 10
Values failed: 0
***Validation PASSED***
True
>>>
returns the size of the data in the table as a tuple
binds a sqlite3 aggregator to DataFrame
- args:
name: string to be associated with the aggregator
arity: the number of inputs required by the aggregator
func: the aggregator class
- returns:
- None
Methods to visualize data.
Makes a histogram plot
where: criterion to apply to table before running analysis
bins: number of bins (default = 10)
range: list of length 2 defining min and max bin edges
Creates a scatter plot with the specified parameters
aname: variable on x-axis
bname: variable on y-axis
None: no model fitting
‘linear’: f(x) = a + b*x
‘exponential’: f(x) = a * x**b
‘logarithmic’: f(x) = a * log(x) + b
‘polynomial’: f(x) = a * x**2 + b*x + c
‘power’: f(x) = a * x**b
Makes a box plot
makes an interaction plot
Plots a matrix of scatterplots
‘kde’: Kernel Density Estimation
‘hist’: 20 bin Histogram
None: just labels
None: no model fitting
‘linear’: f(x) = a + b*x (default)
‘exponential’: f(x) = a * x**b
‘logarithmic’: f(x) = a * log(x) + b
‘polynomial’: f(x) = a * x**2 + b*x + c
‘power’: f(x) = a * x**b
Methods to conduct descriptive and inferential statistical analyses.
Conducts a descriptive statistical analysis of the data in self[key].
Conducts a histogram analysis of the data in self[key].
where: criterion to apply to table before running analysis
bins: number of bins (default = 10)
range: list of length 2 defining min and max bin edges
Calculates means, counts, standard errors, and confidence intervals for the marginal conditions of the factorial combinations specified in the factors list.
factors: list of column labels to segregate data
where: criterion to apply to table before running analysis
Conducts a one-way analysis of variance on val over the conditions in factor. The conditions do not necessarily need to have equal numbers of samples.
- args:
val: dependent variable
factor: a dummy coded column label
- kwds:
- posthoc:
‘tukey’: conduct Tukey posthoc tests
‘SNK’: conduct Newman-Keuls posthoc tests
- where:
- conditions to apply before running analysis
- return:
- an pyvttbl.stats.Anova1way object
conducts a betweeen, within, or mixed, analysis of variance
wfactors: list of within variable factor labels
bfactors: list of between variable factor labels
sub: label coding subjects (or the isomorphism)
transform: string specifying a data transformation
STRING OPTION TRANSFORM COMMENTS ‘’ X default ‘log’,’log10’ numpy.log(X) base 10 transform ‘reciprocal’, ‘inverse’ 1/X ‘square-root’, ‘sqrt’ numpy.sqrt(X) ‘arcsine’, ‘arcsin’ numpy.arcsin(X) ‘windsor 10’ windsor(X, 10) 10% windosr trim
conducts a one-way chi-square goodness-of-fit test on the data in observed
alpha: the type-I error probability
conducts a two-way chi-square goodness-of-fit test on the data in observed
rfactor: column key
cfactor: column key
alpha: the type-I error probability
produces a correlation matrix and conducts step-down significance testing on the column labels in variables.
- args:
- variables: column keys to include in correlation matrix
- kwds:
- coefficient:
- { ‘pearson’, ‘spearman’, ‘kendalltau’, ‘pointbiserial’ }
alpha: the type-I error probability
- where:
- conditions to apply before running analysis
- return:
- an pyvttbl.stats.Correlation object
produces a correlation matrix and conducts step-down significance testing on the column labels in variables.
- args:
- aname: column key
- kwds:
- bname: is not specified a one-sample t-test is performed on
- comparing the values in column aname with a hypothesized population mean.
- pop_mean: specifies the null population mean for one-sample t-test.
- Ignored if bname is supplied
- paired:
True: a paired t-test is conducted
False: an independent samples t-test is conducted
- equal_variance:
True: assumes aname and bname have equal variance
False: assumes aname and bname have unequal variance
- where:
- conditions to apply before running analysis
- return:
- an pyvttbl.stats.Ttest object
These methods are for working interfacing DataFrame with sqlite3.
returns the sqlite3 type associated with the provided key
returns the numpy type object associated with the provided key
a numpy object specifiying the type associated with the data in self[key]:
sql type numpy type ‘null’ np.dtype(object) ‘integer’ np.dtype(int) ‘real’ np.dtype(float) ‘text’ np.dtype(str)
returns the default fill value for invalid data associated with the provided key.
- args:
- key: key in DataFrame (raises KeyError if key not in self)
- returns:
string, float, or int associated with the data in self[key]
sql type default ‘null’ ‘?’ ‘integer’ 999999 ‘real’ 1e20 ‘text’ ‘N/A’
private method to check if the items in self have equal lengths
returns True if all the items are equal
returns False otherwise
determine the sqlite3 datatype of iterable
- args:
- iterable: a 1-d iterable (list, tuple, np.array, etc.)
private method to execute sqlite3 query
private method to execute sqlite3 queries
determines the indices cooresponding to the conditions specified by the where argument.
- args:
- where: a string criterion without the ‘where’
- returns:
- a list of indices
build or rebuild sqlite table with columns in nsubset based on the where list.
- args:
nsubset: a list of keys to include in the table
where: criterion the entries in the table must satisfy
- returns:
- None