Start Here: datascience Tutorial

This is a brief introduction to the functionality in datascience. For a complete reference guide, please see Tables (datascience.tables).

For other useful tutorials and examples, see:

Getting Started

The most important functionality in the package is is the Table class, which is the structure used to represent columns of data. First, load the class:

In [1]: from datascience import Table

In the IPython notebook, type Table. followed by the TAB-key to see a list of members.

Note that for the Data Science 8 class we also import additional packages and settings for all assignments and labs. This is so that plots and other available packages mirror the ones in the textbook more closely. The exact code we use is:

# HIDDEN

import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In particular, the lines involving matplotlib allow for plotting within the IPython notebook.

Creating a Table

A Table is a sequence of labeled columns of data.

A Table can be constructed from scratch by extending an empty table with columns.

In [2]: t = Table().with_columns([
   ...:     'letter', ['a', 'b', 'c', 'z'],
   ...:     'count',  [  9,   3,   3,   1],
   ...:     'points', [  1,   2,   2,  10],
   ...: ])
   ...: 

In [3]: print(t)
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10

More often, a table is read from a CSV file (or an Excel spreadsheet). Here’s the content of an example file:

In [4]: cat sample.csv
x,y,z
1,10,100
2,11,101
3,12,102

And this is how we load it in as a Table using read_table():

In [5]: Table.read_table('sample.csv')
Out[5]: 
x    | y    | z
1    | 10   | 100
2    | 11   | 101
3    | 12   | 102

CSVs from URLs are also valid inputs to read_table():

In [6]: Table.read_table('http://data8.org/textbook/notebooks/sat2014.csv')
Out[6]: 
State        | Participation Rate | Critical Reading | Math | Writing | Combined
North Dakota | 2.3                | 612              | 620  | 584     | 1816
Illinois     | 4.6                | 599              | 616  | 587     | 1802
Iowa         | 3.1                | 605              | 611  | 578     | 1794
South Dakota | 2.9                | 604              | 609  | 579     | 1792
Minnesota    | 5.9                | 598              | 610  | 578     | 1786
Michigan     | 3.8                | 593              | 610  | 581     | 1784
Wisconsin    | 3.9                | 596              | 608  | 578     | 1782
Missouri     | 4.2                | 595              | 597  | 579     | 1771
Wyoming      | 3.3                | 590              | 599  | 573     | 1762
Kansas       | 5.3                | 591              | 596  | 566     | 1753
... (41 rows omitted)

It’s also possible to add columns from a dictionary, but this option is discouraged because dictionaries do not preserve column order.

In [7]: t = Table().with_columns({
   ...:     'letter': ['a', 'b', 'c', 'z'],
   ...:     'count':  [  9,   3,   3,   1],
   ...:     'points': [  1,   2,   2,  10],
   ...: })
   ...: 

In [8]: print(t)
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

Accessing Values

To access values of columns in the table, use column(), which takes a column label or index and returns an array. Alternatively, columns() returns a list of columns (arrays).

In [9]: t
Out[9]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [10]: t.column('letter')
Out[10]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

In [11]: t.column(1)
Out[11]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

You can use bracket notation as a shorthand for this method:

In [12]: t['letter'] # This is a shorthand for t.column('letter')
Out[12]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

In [13]: t[1]        # This is a shorthand for t.column(1)
Out[13]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

To access values by row, row() returns a row by index. Alternatively, rows() returns an list-like Rows object that contains tuple-like Row objects.

In [14]: t.rows
Out[14]: 
Rows(count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10)

In [15]: t.rows[0]
Out[15]: Row(count=9, letter='a', points=1)

In [16]: t.row(0)
Out[16]: Row(count=9, letter='a', points=1)

In [17]: second = t.rows[1]

In [18]: second
Out[18]: Row(count=3, letter='b', points=2)

In [19]: second[0]
Out[19]: 3

In [20]: second[1]
Out[20]: 'b'

To get the number of rows, use num_rows.

In [21]: t.num_rows
Out[21]: 4

Manipulating Data

Here are some of the most common operations on data. For the rest, see the reference (Tables (datascience.tables)).

Adding a column with with_column():

In [22]: t
Out[22]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [23]: t.with_column('vowel?', ['yes', 'no', 'no', 'no'])
Out[23]: 
count | letter | points | vowel?
9     | a      | 1      | yes
3     | b      | 2      | no
3     | c      | 2      | no
1     | z      | 10     | no

In [24]: t # .with_column returns a new table without modifying the original
Out[24]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [25]: t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns
Out[25]: 
count | letter | points | 2 * count
9     | a      | 1      | 18
3     | b      | 2      | 6
3     | c      | 2      | 6
1     | z      | 10     | 2

Selecting columns with select():

In [26]: t.select('letter')
Out[26]: 
letter
a
b
c
z

In [27]: t.select(['letter', 'points'])
Out[27]: 
letter | points
a      | 1
b      | 2
c      | 2
z      | 10

Renaming columns with relabeled():

In [28]: t
Out[28]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [29]: t.relabeled('points', 'other name')
Out[29]: 
count | letter | other name
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [30]: t
Out[30]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [31]: t.relabeled(['letter', 'count', 'points'], ['x', 'y', 'z'])
Out[31]: 
y    | x    | z
9    | a    | 1
3    | b    | 2
3    | c    | 2
1    | z    | 10

Selecting out rows by index with take() and conditionally with where():

In [32]: t
Out[32]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [33]: t.take(2) # the third row
Out[33]: 
count | letter | points
3     | c      | 2

In [34]: t.take[0:2] # the first and second rows
Out[34]: 
count | letter | points
9     | a      | 1
3     | b      | 2
In [35]: t.where('points', 2) # rows where points == 2
Out[35]: 
count | letter | points
3     | b      | 2
3     | c      | 2

In [36]: t.where(t['count'] < 8) # rows where count < 8
Out[36]: 
count | letter | points
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [37]: t['count'] < 8 # .where actually takes in an array of booleans
Out[37]: array([False,  True,  True,  True], dtype=bool)

In [38]: t.where([False, True, True, True]) # same as the last line
Out[38]: 
count | letter | points
3     | b      | 2
3     | c      | 2
1     | z      | 10

Operate on table data with sort(), group(), and pivot()

In [39]: t
Out[39]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [40]: t.sort('count')
Out[40]: 
count | letter | points
1     | z      | 10
3     | b      | 2
3     | c      | 2
9     | a      | 1

In [41]: t.sort('letter', descending = True)
Out[41]: 
count | letter | points
1     | z      | 10
3     | c      | 2
3     | b      | 2
9     | a      | 1
# You may pass a reducing function into the collect arg
# Note the renaming of the points column because of the collect arg
In [42]: t.select(['count', 'points']).group('count', collect=sum)
Out[42]: 
count | points sum
1     | 10
3     | 4
9     | 1
In [43]: other_table = Table().with_columns([
   ....:     'mar_status',  ['married', 'married', 'partner', 'partner', 'married'],
   ....:     'empl_status', ['Working as paid', 'Working as paid', 'Not working',
   ....:                     'Not working', 'Not working'],
   ....:     'count',       [1, 1, 1, 1, 1]])
   ....: 

In [44]: other_table
Out[44]: 
mar_status | empl_status     | count
married    | Working as paid | 1
married    | Working as paid | 1
partner    | Not working     | 1
partner    | Not working     | 1
married    | Not working     | 1

In [45]: other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)
Out[45]: 
empl_status     | married | partner
Not working     | 1       | 2
Working as paid | 2       | 0

Visualizing Data

We’ll start with some data drawn at random from two normal distributions:

In [46]: normal_data = Table().with_columns([
   ....:     'data1', np.random.normal(loc = 1, scale = 2, size = 100),
   ....:     'data2', np.random.normal(loc = 4, scale = 3, size = 100)])
   ....: 

In [47]: normal_data
Out[47]: 
data1     | data2
1.89477   | 1.5579
-1.3217   | 3.4637
1.77446   | 4.33036
0.591197  | 4.24556
2.25328   | 5.63813
1.46267   | -0.0871954
1.63064   | 0.324972
-1.27609  | 5.60276
3.39954   | 2.16965
-0.261784 | 6.89433
... (90 rows omitted)

Draw histograms with hist():

In [48]: normal_data.hist()
_build/html/_images/hist.png
In [49]: normal_data.hist(bins = range(-5, 10))
_build/html/_images/hist_binned.png
In [50]: normal_data.hist(bins = range(-5, 10), overlay = True)
_build/html/_images/hist_overlay.png

If we treat the normal_data table as a set of x-y points, we can plot() and scatter():

In [51]: normal_data.sort('data1').plot('data1') # Sort first to make plot nicer
_build/html/_images/plot.png
In [52]: normal_data.scatter('data1')
_build/html/_images/scatter.png
In [53]: normal_data.scatter('data1', fit_line = True)
_build/html/_images/scatter_line.png

Use barh() to display categorical data.

In [54]: t
Out[54]: 
count | letter | points
9     | a      | 1
3     | b      | 2
3     | c      | 2
1     | z      | 10

In [55]: t.barh('letter')
_build/html/_images/barh.png

Exporting

Exporting to CSV is the most common operation and can be done by first converting to a pandas dataframe with to_df():

In [56]: normal_data
Out[56]: 
data1     | data2
1.89477   | 1.5579
-1.3217   | 3.4637
1.77446   | 4.33036
0.591197  | 4.24556
2.25328   | 5.63813
1.46267   | -0.0871954
1.63064   | 0.324972
-1.27609  | 5.60276
3.39954   | 2.16965
-0.261784 | 6.89433
... (90 rows omitted)

# index = False prevents row numbers from appearing in the resulting CSV
In [57]: normal_data.to_df().to_csv('normal_data.csv', index = False)

An Example

We’ll recreate the steps in Chapter 3 of the textbook to see if there is a significant difference in birth weights between smokers and non-smokers using a bootstrap test.

For more examples, check out the TableDemos repo.

From the text:

The table baby contains data on a random sample of 1,174 mothers and their newborn babies. The column birthwt contains the birth weight of the baby, in ounces; gest_days is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.
In [58]: baby = Table.read_table('http://data8.org/textbook/notebooks/baby.csv')

In [59]: baby # Let's take a peek at the table
Out[59]: 
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker
120          | 284              | 27           | 62              | 100                       | False
113          | 282              | 33           | 64              | 135                       | False
128          | 279              | 28           | 64              | 115                       | True
108          | 282              | 23           | 67              | 125                       | True
136          | 286              | 25           | 62              | 93                        | False
138          | 244              | 33           | 62              | 178                       | False
132          | 245              | 23           | 65              | 140                       | False
120          | 289              | 25           | 62              | 125                       | False
143          | 299              | 30           | 66              | 136                       | True
140          | 351              | 27           | 68              | 120                       | False
... (1164 rows omitted)

# Select out columns we want.
In [60]: smoker_and_wt = baby.select(['m_smoker', 'birthwt'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-60-98d2eeccb488> in <module>()
----> 1 smoker_and_wt = baby.select(['m_smoker', 'birthwt'])

../datascience/tables.py in select(self, *column_label_or_labels)
    589         table = Table()
    590         for label in labels:
--> 591             self._add_column_and_format(table, label, np.copy(self[label]))
    592         return table
    593 

../datascience/tables.py in __getitem__(self, index_or_label)
    168     def __getitem__(self, index_or_label):
    169         label = self._as_label(index_or_label)
--> 170         return self.column(label)
    171 
    172     def __setitem__(self, label, values):

../datascience/tables.py in column(self, index_or_label)
    266             An instance of ``numpy.array``.
    267         """
--> 268         return self._columns[self._as_label(index_or_label)]
    269 
    270     @property

KeyError: 'm_smoker'

In [61]: smoker_and_wt
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-61-1c8046ed122a> in <module>()
----> 1 smoker_and_wt

NameError: name 'smoker_and_wt' is not defined

Let’s compare the number of smokers to non-smokers.

In [62]: smoker_and_wt.select('m_smoker').hist(bins = [0, 1, 2]);
_build/html/_images/m_smoker.png

We can also compare the distribution of birthweights between smokers and non-smokers.

# Non smokers
# We do this by grabbing the rows that correspond to mothers that don't
# smoke, then plotting a histogram of just the birthweights.
In [63]: smoker_and_wt.where('m_smoker', 0).select('birthwt').hist()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-63-ac023d72d786> in <module>()
----> 1 smoker_and_wt.where('m_smoker', 0).select('birthwt').hist()

NameError: name 'smoker_and_wt' is not defined

# Smokers
In [64]: smoker_and_wt.where('m_smoker', 1).select('birthwt').hist()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-64-14a160d7b1d6> in <module>()
----> 1 smoker_and_wt.where('m_smoker', 1).select('birthwt').hist()

NameError: name 'smoker_and_wt' is not defined
_build/html/_images/not_m_smoker_weights.png _build/html/_images/m_smoker_weights.png

What’s the difference in mean birth weight of the two categories?

In [65]: nonsmoking_mean = smoker_and_wt.where('m_smoker', 0).column('birthwt').mean()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-65-f41eeafe6bd7> in <module>()
----> 1 nonsmoking_mean = smoker_and_wt.where('m_smoker', 0).column('birthwt').mean()

NameError: name 'smoker_and_wt' is not defined

In [66]: smoking_mean = smoker_and_wt.where('m_smoker', 1).column('birthwt').mean()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-66-025a08c0d2c6> in <module>()
----> 1 smoking_mean = smoker_and_wt.where('m_smoker', 1).column('birthwt').mean()

NameError: name 'smoker_and_wt' is not defined

In [67]: observed_diff = nonsmoking_mean - smoking_mean
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-67-1dc0201ce02b> in <module>()
----> 1 observed_diff = nonsmoking_mean - smoking_mean

NameError: name 'nonsmoking_mean' is not defined

In [68]: observed_diff
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-68-dbee69efdaa6> in <module>()
----> 1 observed_diff

NameError: name 'observed_diff' is not defined

Let’s do the bootstrap test on the two categories.

In [69]: num_nonsmokers = smoker_and_wt.where('m_smoker', 0).num_rows
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-69-66887dbcfe42> in <module>()
----> 1 num_nonsmokers = smoker_and_wt.where('m_smoker', 0).num_rows

NameError: name 'smoker_and_wt' is not defined

In [70]: def bootstrap_once():
   ....:     """
   ....:     Computes one bootstrapped difference in means.
   ....:     The table.sample method lets us take random samples.
   ....:     We then split according to the number of nonsmokers in the original sample.
   ....:     """
   ....:     resample = smoker_and_wt.sample(with_replacement = True)
   ....:     bootstrap_diff = resample.column('birthwt')[:num_nonsmokers].mean() - \
   ....:         resample.column('birthwt')[num_nonsmokers:].mean()
   ....:     return bootstrap_diff
   ....: 

In [71]: repetitions = 1000

In [72]: bootstrapped_diff_means = np.array(
   ....:     [ bootstrap_once() for _ in range(repetitions) ])
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-72-167c9ef03394> in <module>()
      1 bootstrapped_diff_means = np.array(
----> 2     [ bootstrap_once() for _ in range(repetitions) ])

<ipython-input-72-167c9ef03394> in <listcomp>(.0)
      1 bootstrapped_diff_means = np.array(
----> 2     [ bootstrap_once() for _ in range(repetitions) ])

<ipython-input-70-0af534563c70> in bootstrap_once()
      5     We then split according to the number of nonsmokers in the original sample.
      6     """
----> 7     resample = smoker_and_wt.sample(with_replacement = True)
      8     bootstrap_diff = resample.column('birthwt')[:num_nonsmokers].mean() -         resample.column('birthwt')[num_nonsmokers:].mean()
      9     return bootstrap_diff

NameError: name 'smoker_and_wt' is not defined

In [73]: bootstrapped_diff_means[:10]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-73-7ce34a8f5ddf> in <module>()
----> 1 bootstrapped_diff_means[:10]

NameError: name 'bootstrapped_diff_means' is not defined

In [74]: num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-74-fea00db6f1ce> in <module>()
----> 1 num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum()

NameError: name 'bootstrapped_diff_means' is not defined

In [75]: p_value = num_diffs_greater / len(bootstrapped_diff_means)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-75-053c4de37c34> in <module>()
----> 1 p_value = num_diffs_greater / len(bootstrapped_diff_means)

NameError: name 'num_diffs_greater' is not defined

In [76]: p_value
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-76-1336a4a7d880> in <module>()
----> 1 p_value

NameError: name 'p_value' is not defined

Drawing Maps

To come.