pyvttbl logo

Table Of Contents

Previous topic

Quick-Start Guide

Next topic

PyvtTbl Overview

This Page

DataFrame Overview

Description

DataFrame objects are containers for holding tabulated data. They inherent collections.OrderedDict and hold data in numpy.array objects.

Public Methods

Methods to get data into a DataFrame, manipulate and manage data, and write data. For the most part these remove the user from the fact they are using sqlite3 to manipulate data.


DataFrame.__init__(*args, **kwds)

initialize a DataFrame object.

Subclass of collections. OrderedDict.
Understands the typical initialization dict signatures.
Keys must be hashable.
Values become numpy.arrays or numpy.ma.MaskedArrays.

DataFrame.__setitem__(key, item, mask=None)

assign a column in the table

args:

key: hashable object to associate with item

item: an iterable that is put in an np.array or np.ma.array

kwds:
mask: mask value passed to np.ma.MaskedArray.__init__()
returns:
None
df.__setitem__(key, item) <==> df[key] = item
The assigned item must be iterable. To add a single row use the insert method. To another table to this one use the attach method.
example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> import numpy as np
>>> df['log10(age)'] = np.log10(df['age'])
>>> print(df)
first      last       age   gender   log10(age) 
===============================================
Roger   Lew            28   male          1.447 
Bosco   Robinson        5   male          0.699 
Megan   Whittington    26   female        1.415 
John    Smith          51   male          1.708 
Jane    Doe            49   female        1.690 
>>> 

DataFrame.__delitem__(key)

delete a column from the table

args:
key: associated with the item to delete
returns:
None
df.__delitem__(key) <==> del df[key]
example:
>>> ...
>>> print(df)
first      last       age   gender   log10(age) 
===============================================
Roger   Lew            28   male          1.447 
Bosco   Robinson        5   male          0.699 
Megan   Whittington    26   female        1.415 
John    Smith          51   male          1.708 
Jane    Doe            49   female        1.690 
>>> del df['log10(age)']
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> 

DataFrame.__str__()

returns human friendly string representation of object

args:
None
returns:
string with easy to read representation of table
df.__str__() <==> str(df)

DataFrame.read_tbl(fname, skip=0, delimiter=', ', labels=True)

loads tabulated data from a plain text file

args:
fname: path and name of datafile
kwds:

skip: number of lines to skip before looking for column labels. (default = 0)

delimiter: string to seperate values (default = “’”)

labels: bool specifiying whether first row (after skip) contains labels. (default = True)

returns:
None
Checks and renames duplicate column labels as well as checking
for missing cells. readTbl will warn and skip over missing lines.

DataFrame.attach(other)

attaches a second DataFrame to self

args:
other: a DataFrame object whose key set matches self
return:
None

DataFrame.insert(row)

insert a row into the table

args:
row: should be mappable. e.g. a dict or a list with key/value pairs.
returns:
None
example:
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
>>> 

DataFrame.sort(order=None)

sort the table in-place

kwds:
order: is a list of factors to sort by to reverse order append ” desc” to the factor
returns:
None
example:
>>> from pyvttbl import DataFrame
>>> from collections import namedtuple
>>> Person = namedtuple('Person',['first','last','age','gender'])
>>> df =DataFrame()
>>> df.insert(Person('Roger', 'Lew', 28, 'male')._asdict())
>>> df.insert(Person('Bosco', 'Robinson', 5, 'male')._asdict())
>>> df.insert(Person('Megan', 'Whittington', 26, 'female')._asdict())
>>> df.insert(Person('John', 'Smith', 51, 'male')._asdict())
>>> df.insert(Person('Jane', 'Doe', 49, 'female')._asdict())
>>> df.sort(['gender', 'age'])
>>> print(df)
first      last       age   gender 
==================================
Megan   Whittington    26   female 
Jane    Doe            49   female 
Bosco   Robinson        5   male   
Roger   Lew            28   male   
John    Smith          51   male   
>>> 

DataFrame.select_col(key, where=None)

determines rows in table that satisfy the conditions given by where and returns the values of key in the remaining rows

args:
key: column label of data to return
kwds:
where: constraints to apply to table before returning data
returns:
a list
example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female 
>>> df.select_col('age', where='gender == "male"')
[28, 5, 51]
>>> 

DataFrame.row_iter()

iterate over the rows in table

args:
None
returns:
iterator that yields OrderedDict objects with (key,value) pairs cooresponding to the data in each row
example:
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   male   
John    Smith          51   female 
Jane    Doe            49   female 
>>> for case in df.row_iter():
        print(case)
OrderedDict([('first', 'Roger'), ('last', 'Lew'), ('age', 28), ('gender', 'male')])
OrderedDict([('first', 'Bosco'), ('last', 'Robinson'), ('age', 5), ('gender', 'male')])
OrderedDict([('first', 'Megan'), ('last', 'Whittington'), ('age', 26), ('gender', 'male')])
OrderedDict([('first', 'John'), ('last', 'Smith'), ('age', 51), ('gender', 'female')])
OrderedDict([('first', 'Jane'), ('last', 'Doe'), ('age', 49), ('gender', 'female')])
>>> 

DataFrame.pivot(val, rows=None, cols=None, aggregate='avg', where=None, attach_rlabels=False, method='valid')

produces a contingency table according to the arguments and keywords provided.

args:
val: the colname to place as the data in the table
kwds:
rows: list of colnames whos combinations will become rows
in the table if left blank their will be one row
cols: list of colnames whos combinations will become cols
in the table if left blank their will be one col
aggregate: function applied across data going into each cell
of the table <http://www.sqlite.org/lang_aggfunc.html>_

where: list of tuples or list of strings for filtering data

method:

‘valid’: only returns rows or columns with valid entries.

‘full’: return full factorial combinations of the
conditions specified by rows and cols
returns:
PyvtTbl object

DataFrame.where(where)

Applies the where filter to a copy of the DataFrame, and returns the new DataFrame. The associated DataFrame is not copied.

args:
where: criterion to apply to new table
returns:
a new DataFrame
example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female
>>> print(df.where('age > 20 and age < 45'))
first      last       age   gender 
==================================
Roger   Lew            28   male   
Megan   Whittington    26   female 
>>> 

DataFrame.where_update(where)

Applies the where filter in-place.

args:
where: criterion to apply to table
returns:
None

DataFrame.summary(where=None)

prints the descriptive information for each column in DataFrame

kwds:
where: criterion to apply to table before running analysis
returns:
None

DataFrame.validate(criteria, verbose=False, report=False)

validate the data in the table.

args:
criteria: a dict whose keys should coorespond to columns in the table. The values should be functions which take a single parameter and return a boolean.
kwds:
verbose:

True: provide real-time feedback

False: don’t provide feedback (default)

report:

True: print a report upon completion

False: don’t print report (default)

returns:

True: the criteria was satisfied

False: the critera was not satisfied

example:
>>> ...
>>> print(df)
first      last       age   gender 
==================================
Roger   Lew            28   male   
Bosco   Robinson        5   male   
Megan   Whittington    26   female 
John    Smith          51   male   
Jane    Doe            49   female
>>> def isint(x):
        try : return int(x)-float(x)==0
        except:  return False
>>> df.validate({'age' : lambda x: isint(x),
                 'gender' : lambda x: x in ['male', 'female']},
                 verbose=True, report=True)                    
Validating gender:
.....
Validating age:
.....
Report:
  Values tested: 10 
  Values passed: 10 
  Values failed: 0
***Validation PASSED***
True
>>>

DataFrame.types()
returns a list of the sqlite3 datatypes of the columns
args:
None
returns:
an ordered list of sqlite3 types.
order matches self.keys()

DataFrame.shape()

returns the size of the data in the table as a tuple

args:
None
returns:
tuple (number of columns, number of rows)

DataFrame.bind_aggregate(name, arity, func)

binds a sqlite3 aggregator to DataFrame

args:

name: string to be associated with the aggregator

arity: the number of inputs required by the aggregator

func: the aggregator class

returns:
None
DataFrame.aggregates is a list of the available aggregators.
For information on rolling your own aggregators see: http://docs.python.org/library/sqlite3.html

pyvttbl.plotting Wrappers

Methods to visualize data.


DataFrame.histogram_plot(val, **kwargs)

Makes a histogram plot

args:
key: column label of dependent variable
kwds:

where: criterion to apply to table before running analysis

bins: number of bins (default = 10)

range: list of length 2 defining min and max bin edges


DataFrame.scatter_plot(aname, bname, **kwargs)

Creates a scatter plot with the specified parameters

args:

aname: variable on x-axis

bname: variable on y-axis

kwds:
alpha:
amount of transparency applied
trend :

None: no model fitting

‘linear’: f(x) = a + b*x

‘exponential’: f(x) = a * x**b

‘logarithmic’: f(x) = a * log(x) + b

‘polynomial’: f(x) = a * x**2 + b*x + c

‘power’: f(x) = a * x**b


DataFrame.box_plot(val, factors=None, **kwargs)

Makes a box plot

args:
df:
a pyvttbl.DataFrame object
val:
the label of the dependent variable
kwds:
factors:
a list of factors to include in boxplot
where:
a string, list of strings, or list of tuples applied to the DataFrame before plotting
fname:
output file name
quality:
{‘low’ | ‘medium’ | ‘high’} specifies image file dpi

DataFrame.interaction_plot(val, xaxis, **kwargs)

makes an interaction plot

args:
df:
a pyvttbl.DataFrame object
val:
the label of the dependent variable
xaxis:
the label of the variable to place on the xaxis of each subplot
kwds:
seplines:
label specifying seperate lines in each subplot
sepxplots:
label specifying seperate horizontal subplots
sepyplots:
label specifying separate vertical subplots
xmin:
(‘AUTO’ by default) minimum xaxis value across subplots
xmax:
(‘AUTO’ by default) maximum xaxis value across subplots
ymin:
(‘AUTO’ by default) minimum yaxis value across subplots
ymax:
(‘AUTO’ by default) maximum yaxis value across subplots
where:
a string, list of strings, or list of tuples applied to the DataFrame before plotting
fname:
output file name
quality:
{‘low’ | ‘medium’ | ‘high’} specifies image file dpi
yerr:
{float, ‘ci’, ‘stdev’, ‘sem’} designates errorbars across datapoints in all subplots

DataFrame.scatter_matrix(variables, **kwargs)

Plots a matrix of scatterplots

args:
variables:
column labels to include in scatter matrix
kwds:
alpha:
amount of transparency applied
grid:
setting this to True will show the grid
diagonal:

‘kde’: Kernel Density Estimation

‘hist’: 20 bin Histogram

None: just labels

trend :

None: no model fitting

‘linear’: f(x) = a + b*x (default)

‘exponential’: f(x) = a * x**b

‘logarithmic’: f(x) = a * log(x) + b

‘polynomial’: f(x) = a * x**2 + b*x + c

‘power’: f(x) = a * x**b

alternate_labels: Specifies whether the labels and ticks should
alternate. Default is True. When False tick labels will be on the left and botttom, and variable labels will be on the top and right.

pyvttbl.stats Wrappers

Methods to conduct descriptive and inferential statistical analyses.


DataFrame.descriptives(key, where=None)

Conducts a descriptive statistical analysis of the data in self[key].

args:
key: column label
kwds:
where: criterion to apply to table before running analysis
returns:
a pyvttbl.stats. Descriptives object

DataFrame.histogram(key, where=None, bins=10, range=None, density=False, cumulative=False)

Conducts a histogram analysis of the data in self[key].

args:
key: column label of dependent variable
kwds:

where: criterion to apply to table before running analysis

bins: number of bins (default = 10)

range: list of length 2 defining min and max bin edges

returns:
a pyvttbl.stats. Descriptives object

DataFrame.marginals(key, factors, where=None)

Calculates means, counts, standard errors, and confidence intervals for the marginal conditions of the factorial combinations specified in the factors list.

args:
key: column label (of the dependent variable)
kwds:

factors: list of column labels to segregate data

where: criterion to apply to table before running analysis

returns:
a pyvttbl.stats. Marginals object

DataFrame.anova1way(val, factor, posthoc='tukey', where=None)

Conducts a one-way analysis of variance on val over the conditions in factor. The conditions do not necessarily need to have equal numbers of samples.

args:

val: dependent variable

factor: a dummy coded column label

kwds:
posthoc:

‘tukey’: conduct Tukey posthoc tests

‘SNK’: conduct Newman-Keuls posthoc tests

where:
conditions to apply before running analysis
return:
an pyvttbl.stats.Anova1way object

DataFrame.anova(dv, sub='SUBJECT', wfactors=None, bfactors=None, measure='', transform='', alpha=0.05)

conducts a betweeen, within, or mixed, analysis of variance

args:
dv: label containing dependent variable
kwds:

wfactors: list of within variable factor labels

bfactors: list of between variable factor labels

sub: label coding subjects (or the isomorphism)

measure: string to describe dv (outputs ‘<dv> of
<measure>’) intended when dv name is generic (e.g., MEAN, RMS, SD, ...)

transform: string specifying a data transformation

STRING OPTION TRANSFORM COMMENTS
‘’ X default
‘log’,’log10’ numpy.log(X) base 10 transform
‘reciprocal’, ‘inverse’ 1/X  
‘square-root’, ‘sqrt’ numpy.sqrt(X)  
‘arcsine’, ‘arcsin’ numpy.arcsin(X)  
‘windsor 10’ windsor(X, 10) 10% windosr trim

DataFrame.chisquare1way(observed, expected_dict=None, alpha=0.05, where=None)

conducts a one-way chi-square goodness-of-fit test on the data in observed

args:
observed: column label containing categorical observations
kwds:
expected_dict: a dictionary object with keys matching the categories
in observed and values with the expected counts. The categories in the observed column must be a subset of the keys in the expected_dict. If expected_dict is None, the total N is assumed to be equally distributed across all groups.

alpha: the type-I error probability

where:
conditions to apply before running analysis
return:
an pyvttbl.stats.ChiSquare1way object

DataFrame.chisquare2way(rfactor, cfactor, alpha=0.05, where=None)

conducts a two-way chi-square goodness-of-fit test on the data in observed

args:

rfactor: column key

cfactor: column key

kwds:

alpha: the type-I error probability

where:
conditions to apply before running analysis
return:
an pyvttbl.stats.ChiSquare2way object

DataFrame.correlation(variables, coefficient='pearson', alpha=0.05, where=None)

produces a correlation matrix and conducts step-down significance testing on the column labels in variables.

args:
variables: column keys to include in correlation matrix
kwds:
coefficient:
{ ‘pearson’, ‘spearman’, ‘kendalltau’, ‘pointbiserial’ }

alpha: the type-I error probability

where:
conditions to apply before running analysis
return:
an pyvttbl.stats.Correlation object

DataFrame.ttest(aname, bname=None, pop_mean=0.0, paired=False, equal_variance=True, where=None)

produces a correlation matrix and conducts step-down significance testing on the column labels in variables.

args:
aname: column key
kwds:
bname: is not specified a one-sample t-test is performed on
comparing the values in column aname with a hypothesized population mean.
pop_mean: specifies the null population mean for one-sample t-test.
Ignored if bname is supplied
paired:

True: a paired t-test is conducted

False: an independent samples t-test is conducted

equal_variance:

True: assumes aname and bname have equal variance

False: assumes aname and bname have unequal variance

where:
conditions to apply before running analysis
return:
an pyvttbl.stats.Ttest object

Private Methods

These methods are for working interfacing DataFrame with sqlite3.


DataFrame._get_sqltype(key)

returns the sqlite3 type associated with the provided key

args:
key: key in DataFrame (raises KeyError if key not in self)
returns:
a string specifiying the sqlite3 type associated with the data in self[key]:
{ ‘null’, ‘integer’, ‘real’, ‘text’}

DataFrame._get_nptype(key)

returns the numpy type object associated with the provided key

args:
key: key in DataFrame (raises KeyError if key not in self)
returns:

a numpy object specifiying the type associated with the data in self[key]:

sql type numpy type
‘null’ np.dtype(object)
‘integer’ np.dtype(int)
‘real’ np.dtype(float)
‘text’ np.dtype(str)

DataFrame._get_mafillvalue(key)

returns the default fill value for invalid data associated with the provided key.

args:
key: key in DataFrame (raises KeyError if key not in self)
returns:

string, float, or int associated with the data in self[key]

sql type default
‘null’ ‘?’
‘integer’ 999999
‘real’ 1e20
‘text’ ‘N/A’
returned values match the defaults associated with np.ma.MaskedArray

DataFrame._are_col_lengths_equal()

private method to check if the items in self have equal lengths

args:
None
returns:

returns True if all the items are equal

returns False otherwise


DataFrame._determine_sqlite3_type(iterable)

determine the sqlite3 datatype of iterable

args:
iterable: a 1-d iterable (list, tuple, np.array, etc.)
returns:
sqlite3 type as string: ‘null’, ‘integer’, ‘real’, or ‘text’

DataFrame._execute(query, t=None)

private method to execute sqlite3 query

When the PRINTQUERIES bool is true it prints the queries before executing them

DataFrame._executemany(query, tlist)

private method to execute sqlite3 queries

When the PRINTQUERIES bool is true it prints the queries before executing them. The execute many method is about twice as fast for building tables as the execute method.

DataFrame._get_indices_where(where)

determines the indices cooresponding to the conditions specified by the where argument.

args:
where: a string criterion without the ‘where’
returns:
a list of indices

DataFrame._build_sqlite3_tbl(nsubset, where=None)

build or rebuild sqlite table with columns in nsubset based on the where list.

args:

nsubset: a list of keys to include in the table

where: criterion the entries in the table must satisfy

returns:
None
where can be a list of tuples. Each tuple should have three elements. The first should be a column key (label). The second should be an operator: in, =, !=, <, >. The third element should contain value for the operator.
where can also be a list of strings. or a single string.
sqlite3 table is built in memory and has the id TBL

This software is funded in part by NIH Grant P20 RR016454.
© Copyright 2012, Roger Lew. Created using Sphinx 1.1.3.