Start Here: datascience
Tutorial¶
This is a brief introduction to the functionality in
datascience
. For a complete reference guide, please see
Tables (datascience.tables).
For other useful tutorials and examples, see:
Table of Contents
Getting Started¶
The most important functionality in the package is is the Table
class, which is the structure used to represent columns of data. First, load
the class:
In [1]: from datascience import Table
In the IPython notebook, type Table.
followed by the TAB-key to see a list
of members.
Note that for the Data Science 8 class we also import additional packages and settings for all assignments and labs. This is so that plots and other available packages mirror the ones in the textbook more closely. The exact code we use is:
# HIDDEN
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')
In particular, the lines involving matplotlib
allow for plotting within the
IPython notebook.
Creating a Table¶
A Table is a sequence of labeled columns of data.
A Table can be constructed from scratch by extending an empty table with columns.
In [2]: t = Table().with_columns([
...: 'letter', ['a', 'b', 'c', 'z'],
...: 'count', [ 9, 3, 3, 1],
...: 'points', [ 1, 2, 2, 10],
...: ])
...:
In [3]: print(t)
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
More often, a table is read from a CSV file (or an Excel spreadsheet). Here’s the content of an example file:
In [4]: cat sample.csv
x,y,z
1,10,100
2,11,101
3,12,102
And this is how we load it in as a Table
using
read_table()
:
In [5]: Table.read_table('sample.csv')
Out[5]:
x | y | z
1 | 10 | 100
2 | 11 | 101
3 | 12 | 102
CSVs from URLs are also valid inputs to
read_table()
:
In [6]: Table.read_table('http://data8.org/textbook/notebooks/sat2014.csv')
Out[6]:
State | Participation Rate | Critical Reading | Math | Writing | Combined
North Dakota | 2.3 | 612 | 620 | 584 | 1816
Illinois | 4.6 | 599 | 616 | 587 | 1802
Iowa | 3.1 | 605 | 611 | 578 | 1794
South Dakota | 2.9 | 604 | 609 | 579 | 1792
Minnesota | 5.9 | 598 | 610 | 578 | 1786
Michigan | 3.8 | 593 | 610 | 581 | 1784
Wisconsin | 3.9 | 596 | 608 | 578 | 1782
Missouri | 4.2 | 595 | 597 | 579 | 1771
Wyoming | 3.3 | 590 | 599 | 573 | 1762
Kansas | 5.3 | 591 | 596 | 566 | 1753
... (41 rows omitted)
It’s also possible to add columns from a dictionary, but this option is discouraged because dictionaries do not preserve column order.
In [7]: t = Table().with_columns({
...: 'letter': ['a', 'b', 'c', 'z'],
...: 'count': [ 9, 3, 3, 1],
...: 'points': [ 1, 2, 2, 10],
...: })
...:
In [8]: print(t)
points | count | letter
1 | 9 | a
2 | 3 | b
2 | 3 | c
10 | 1 | z
Accessing Values¶
To access values of columns in the table, use
column()
, which takes a column label or index
and returns an array. Alternatively, columns()
returns a list of columns (arrays).
In [9]: t
Out[9]:
points | count | letter
1 | 9 | a
2 | 3 | b
2 | 3 | c
10 | 1 | z
In [10]: t.column('letter')