Welcome to datascience’s documentation!

Release:0.8.2
Date:February 05, 2017

The datascience package was written for use in Berkeley’s DS 8 course and contains useful functionality for investigating and graphically displaying data.

Start Here: datascience Tutorial

This is a brief introduction to the functionality in datascience. For a complete reference guide, please see Tables (datascience.tables).

For other useful tutorials and examples, see:

Getting Started

The most important functionality in the package is is the Table class, which is the structure used to represent columns of data. First, load the class:

In [1]: from datascience import Table

In the IPython notebook, type Table. followed by the TAB-key to see a list of members.

Note that for the Data Science 8 class we also import additional packages and settings for all assignments and labs. This is so that plots and other available packages mirror the ones in the textbook more closely. The exact code we use is:

# HIDDEN

import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In particular, the lines involving matplotlib allow for plotting within the IPython notebook.

Creating a Table

A Table is a sequence of labeled columns of data.

A Table can be constructed from scratch by extending an empty table with columns.

In [2]: t = Table().with_columns([
   ...:     'letter', ['a', 'b', 'c', 'z'],
   ...:     'count',  [  9,   3,   3,   1],
   ...:     'points', [  1,   2,   2,  10],
   ...: ])
   ...: 

In [3]: print(t)
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10

More often, a table is read from a CSV file (or an Excel spreadsheet). Here’s the content of an example file:

In [4]: cat sample.csv
x,y,z
1,10,100
2,11,101
3,12,102

And this is how we load it in as a Table using read_table():

In [5]: Table.read_table('sample.csv')
Out[5]: 
x    | y    | z
1    | 10   | 100
2    | 11   | 101
3    | 12   | 102

CSVs from URLs are also valid inputs to read_table():

In [6]: Table.read_table('http://data8.org/textbook/notebooks/sat2014.csv')
Out[6]: 
State        | Participation Rate | Critical Reading | Math | Writing | Combined
North Dakota | 2.3                | 612              | 620  | 584     | 1816
Illinois     | 4.6                | 599              | 616  | 587     | 1802
Iowa         | 3.1                | 605              | 611  | 578     | 1794
South Dakota | 2.9                | 604              | 609  | 579     | 1792
Minnesota    | 5.9                | 598              | 610  | 578     | 1786
Michigan     | 3.8                | 593              | 610  | 581     | 1784
Wisconsin    | 3.9                | 596              | 608  | 578     | 1782
Missouri     | 4.2                | 595              | 597  | 579     | 1771
Wyoming      | 3.3                | 590              | 599  | 573     | 1762
Kansas       | 5.3                | 591              | 596  | 566     | 1753
... (41 rows omitted)

It’s also possible to add columns from a dictionary, but this option is discouraged because dictionaries do not preserve column order.

In [7]: t = Table().with_columns({
   ...:     'letter': ['a', 'b', 'c', 'z'],
   ...:     'count':  [  9,   3,   3,   1],
   ...:     'points': [  1,   2,   2,  10],
   ...: })
   ...: 

In [8]: print(t)
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

Accessing Values

To access values of columns in the table, use column(), which takes a column label or index and returns an array. Alternatively, columns() returns a list of columns (arrays).

In [9]: t
Out[9]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [10]: t.column('letter')
Out[10]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

In [11]: t.column(1)
Out[11]: array([ 1,  2,  2, 10])

You can use bracket notation as a shorthand for this method:

In [12]: t['letter'] # This is a shorthand for t.column('letter')
Out[12]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')

In [13]: t[1]        # This is a shorthand for t.column(1)
Out[13]: array([ 1,  2,  2, 10])

To access values by row, row() returns a row by index. Alternatively, rows() returns an list-like Rows object that contains tuple-like Row objects.

In [14]: t.rows
Out[14]: 
Rows(letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1)

In [15]: t.rows[0]
Out[15]: Row(letter='a', points=1, count=9)

In [16]: t.row(0)
Out[16]: Row(letter='a', points=1, count=9)

In [17]: second = t.rows[1]

In [18]: second
Out[18]: Row(letter='b', points=2, count=3)

In [19]: second[0]
Out[19]: 'b'

In [20]: second[1]
Out[20]: 2

To get the number of rows, use num_rows.

In [21]: t.num_rows
Out[21]: 4

Manipulating Data

Here are some of the most common operations on data. For the rest, see the reference (Tables (datascience.tables)).

Adding a column with with_column():

In [22]: t
Out[22]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [23]: t.with_column('vowel?', ['yes', 'no', 'no', 'no'])
Out[23]: 
letter | points | count | vowel?
a      | 1      | 9     | yes
b      | 2      | 3     | no
c      | 2      | 3     | no
z      | 10     | 1     | no

In [24]: t # .with_column returns a new table without modifying the original
Out[24]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [25]: t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns
Out[25]: 
letter | points | count | 2 * count
a      | 1      | 9     | 18
b      | 2      | 3     | 6
c      | 2      | 3     | 6
z      | 10     | 1     | 2

Selecting columns with select():

In [26]: t.select('letter')
Out[26]: 
letter
a
b
c
z

In [27]: t.select(['letter', 'points'])
Out[27]: 
letter | points
a      | 1
b      | 2
c      | 2
z      | 10

Renaming columns with relabeled():

In [28]: t
Out[28]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [29]: t.relabeled('points', 'other name')
Out[29]: 
letter | other name | count
a      | 1          | 9
b      | 2          | 3
c      | 2          | 3
z      | 10         | 1

In [30]: t
Out[30]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [31]: t.relabeled(['letter', 'count', 'points'], ['x', 'y', 'z'])
Out[31]: 
x    | z    | y
a    | 1    | 9
b    | 2    | 3
c    | 2    | 3
z    | 10   | 1

Selecting out rows by index with take() and conditionally with where():

In [32]: t
Out[32]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [33]: t.take(2) # the third row
Out[33]: 
letter | points | count
c      | 2      | 3

In [34]: t.take[0:2] # the first and second rows
Out[34]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
In [35]: t.where('points', 2) # rows where points == 2
Out[35]: 
letter | points | count
b      | 2      | 3
c      | 2      | 3

In [36]: t.where(t['count'] < 8) # rows where count < 8
Out[36]: 
letter | points | count
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [37]: t['count'] < 8 # .where actually takes in an array of booleans
Out[37]: array([False,  True,  True,  True], dtype=bool)

In [38]: t.where([False, True, True, True]) # same as the last line
Out[38]: 
letter | points | count
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

Operate on table data with sort(), group(), and pivot()

In [39]: t
Out[39]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [40]: t.sort('count')
Out[40]: 
letter | points | count
z      | 10     | 1
b      | 2      | 3
c      | 2      | 3
a      | 1      | 9

In [41]: t.sort('letter', descending = True)
Out[41]: 
letter | points | count
z      | 10     | 1
c      | 2      | 3
b      | 2      | 3
a      | 1      | 9
# You may pass a reducing function into the collect arg
# Note the renaming of the points column because of the collect arg
In [42]: t.select(['count', 'points']).group('count', collect=sum)
Out[42]: 
count | points sum
1     | 10
3     | 4
9     | 1
In [43]: other_table = Table().with_columns([
   ....:     'mar_status',  ['married', 'married', 'partner', 'partner', 'married'],
   ....:     'empl_status', ['Working as paid', 'Working as paid', 'Not working',
   ....:                     'Not working', 'Not working'],
   ....:     'count',       [1, 1, 1, 1, 1]])
   ....: 

In [44]: other_table
Out[44]: 
mar_status | empl_status     | count
married    | Working as paid | 1
married    | Working as paid | 1
partner    | Not working     | 1
partner    | Not working     | 1
married    | Not working     | 1

In [45]: other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)
Out[45]: 
empl_status     | married | partner
Not working     | 1       | 2
Working as paid | 2       | 0

Visualizing Data

We’ll start with some data drawn at random from two normal distributions:

In [46]: normal_data = Table().with_columns([
   ....:     'data1', np.random.normal(loc = 1, scale = 2, size = 100),
   ....:     'data2', np.random.normal(loc = 4, scale = 3, size = 100)])
   ....: 

In [47]: normal_data
Out[47]: 
data1     | data2
-0.514402 | 3.18916
1.66847   | 4.25558
-0.545203 | 10.7839
2.86205   | 5.11639
1.15393   | 3.89101
2.44257   | 4.76492
2.90601   | 5.90341
-1.74658  | 8.57672
2.08995   | -1.97573
1.54763   | 7.81459
... (90 rows omitted)

Draw histograms with hist():

In [48]: normal_data.hist()
_build/localmedia/_images/hist.png
In [49]: normal_data.hist(bins = range(-5, 10))
_build/localmedia/_images/hist_binned.png
In [50]: normal_data.hist(bins = range(-5, 10), overlay = True)
_build/localmedia/_images/hist_overlay.png

If we treat the normal_data table as a set of x-y points, we can plot() and scatter():

In [51]: normal_data.sort('data1').plot('data1') # Sort first to make plot nicer
_build/localmedia/_images/plot.png
In [52]: normal_data.scatter('data1')
_build/localmedia/_images/scatter.png
In [53]: normal_data.scatter('data1', fit_line = True)
_build/localmedia/_images/scatter_line.png

Use barh() to display categorical data.

In [54]: t
Out[54]: 
letter | points | count
a      | 1      | 9
b      | 2      | 3
c      | 2      | 3
z      | 10     | 1

In [55]: t.barh('letter')
_build/localmedia/_images/barh.png

Exporting

Exporting to CSV is the most common operation and can be done by first converting to a pandas dataframe with to_df():

In [56]: normal_data
Out[56]: 
data1     | data2
-0.514402 | 3.18916
1.66847   | 4.25558
-0.545203 | 10.7839
2.86205   | 5.11639
1.15393   | 3.89101
2.44257   | 4.76492
2.90601   | 5.90341
-1.74658  | 8.57672
2.08995   | -1.97573
1.54763   | 7.81459
... (90 rows omitted)

# index = False prevents row numbers from appearing in the resulting CSV
In [57]: normal_data.to_df().to_csv('normal_data.csv', index = False)

An Example

We’ll recreate the steps in Chapter 3 of the textbook to see if there is a significant difference in birth weights between smokers and non-smokers using a bootstrap test.

For more examples, check out the TableDemos repo.

From the text:

The table baby contains data on a random sample of 1,174 mothers and their newborn babies. The column birthwt contains the birth weight of the baby, in ounces; gest_days is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.
In [58]: baby = Table.read_table('https://github.com/data-8/textbook/raw/9aa0a167bc514749338cd7754f2b339fd095ee9b/notebooks/baby.csv')

In [59]: baby # Let's take a peek at the table
Out[59]: 
birthwt | gest_days | mat_age | mat_ht | mat_pw | m_smoker
120     | 284       | 27      | 62     | 100    | 0
113     | 282       | 33      | 64     | 135    | 0
128     | 279       | 28      | 64     | 115    | 1
108     | 282       | 23      | 67     | 125    | 1
136     | 286       | 25      | 62     | 93     | 0
138     | 244       | 33      | 62     | 178    | 0
132     | 245       | 23      | 65     | 140    | 0
120     | 289       | 25      | 62     | 125    | 0
143     | 299       | 30      | 66     | 136    | 1
140     | 351       | 27      | 68     | 120    | 0
... (1164 rows omitted)

# Select out columns we want.
In [60]: smoker_and_wt = baby.select(['m_smoker', 'birthwt'])

In [61]: smoker_and_wt
Out[61]: 
m_smoker | birthwt
0        | 120
0        | 113
1        | 128
1        | 108
0        | 136
0        | 138
0        | 132
0        | 120
1        | 143
0        | 140
... (1164 rows omitted)

Let’s compare the number of smokers to non-smokers.

In [62]: smoker_and_wt.select('m_smoker').hist(bins = [0, 1, 2]);
_build/localmedia/_images/m_smoker.png

We can also compare the distribution of birthweights between smokers and non-smokers.

# Non smokers
# We do this by grabbing the rows that correspond to mothers that don't
# smoke, then plotting a histogram of just the birthweights.
In [63]: smoker_and_wt.where('m_smoker', 0).select('birthwt').hist()

# Smokers
In [64]: smoker_and_wt.where('m_smoker', 1).select('birthwt').hist()
_build/localmedia/_images/not_m_smoker_weights.png _build/localmedia/_images/m_smoker_weights.png

What’s the difference in mean birth weight of the two categories?

In [65]: nonsmoking_mean = smoker_and_wt.where('m_smoker', 0).column('birthwt').mean()

In [66]: smoking_mean = smoker_and_wt.where('m_smoker', 1).column('birthwt').mean()

In [67]: observed_diff = nonsmoking_mean - smoking_mean

In [68]: observed_diff
Out[68]: 9.2661425720249184

Let’s do the bootstrap test on the two categories.

In [69]: num_nonsmokers = smoker_and_wt.where('m_smoker', 0).num_rows

In [70]: def bootstrap_once():
   ....:     """
   ....:     Computes one bootstrapped difference in means.
   ....:     The table.sample method lets us take random samples.
   ....:     We then split according to the number of nonsmokers in the original sample.
   ....:     """
   ....:     resample = smoker_and_wt.sample(with_replacement = True)
   ....:     bootstrap_diff = resample.column('birthwt')[:num_nonsmokers].mean() - \
   ....:         resample.column('birthwt')[num_nonsmokers:].mean()
   ....:     return bootstrap_diff
   ....: 

In [71]: repetitions = 1000

In [72]: bootstrapped_diff_means = np.array(
   ....:     [ bootstrap_once() for _ in range(repetitions) ])
   ....: 

In [73]: bootstrapped_diff_means[:10]
Out[73]: 
array([ 0.80763898, -0.13269954,  0.32791261, -0.54867224, -0.50712555,
       -2.07022563,  0.95777991,  0.69985222,  0.48984871, -0.65587397])

In [74]: num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum()

In [75]: p_value = num_diffs_greater / len(bootstrapped_diff_means)

In [76]: p_value
Out[76]: 0.0

Drawing Maps

To come.

Reference

Tables (datascience.tables)

Summary of methods for Table. Click a method to see its documentation.

One note about reading the method signatures for this page: each method is listed with its arguments. However, optional arguments are specified in brackets. That is, a method that’s documented like

Table.foo (first_arg, second_arg[, some_other_arg, fourth_arg])

means that the Table.foo method must be called with first_arg and second_arg and optionally some_other_arg and fourth_arg. That means the following are valid ways to call Table.foo:

some_table.foo(1, 2)
some_table.foo(1, 2, 'hello')
some_table.foo(1, 2, 'hello', 'world')
some_table.foo(1, 2, some_other_arg='hello')

But these are not valid:

some_table.foo(1) # Missing arg
some_table.foo(1, 2[, 'hi']) # SyntaxError
some_table.foo(1, 2[, 'hello', 'world']) # SyntaxError

If that syntax is confusing, you can click the method name itself to get to the details page for that method. That page will have a more straightforward syntax.

At the time of this writing, most methods only have one or two sentences of documentation, so what you see here is all that you’ll get for the time being. We are actively working on documentation, prioritizing the most complicated methods (mostly visualizations).

Creation

Table.__init__([labels, _deprecated, formatter]) Create an empty table with column labels.
Table.empty([labels]) Creates an empty table.
Table.from_records(records) Create a table from a sequence of records (dicts with fixed keys).
Table.from_columns_dict(columns) Create a table from a mapping of column labels to column values.
Table.read_table(filepath_or_buffer, *args, ...) Read a table from a file or web address.
Table.from_df(df) Convert a Pandas DataFrame into a Table.
Table.from_array(arr) Convert a structured NumPy array into a Table.

Extension (does not modify original table)

Table.with_column(label, values) Return a new table with an additional or replaced column.
Table.with_columns(*labels_and_values) Return a table with additional or replaced columns.
Table.with_row(row) Return a table with an additional row.
Table.with_rows(rows) Return a table with additional rows.
Table.relabeled(label, new_label) Returns a new table with label specifying column label(s) replaced by corresponding new_label.

Accessing values

Table.num_columns Number of columns.
Table.columns
Table.column(index_or_label) Return the values of a column as an array.
Table.num_rows Number of rows.
Table.rows Return a view of all rows.
Table.row(index) Return a row.
Table.labels Return a tuple of column labels.
Table.column_index(column_label) Return the index of a column.
Table.apply(fn[, column_label]) Apply fn to each element of column_label.

Mutation (modifies table in place)

Table.set_format(column_label_or_labels, ...) Set the format of a column.
Table.move_to_start(column_label) Move a column to the first in order.
Table.move_to_end(column_label) Move a column to the last in order.
Table.append(row_or_table) Append a row or all rows of a table.
Table.append_column(label, values) Appends a column to the table or replaces a column.
Table.relabel(column_label, new_label) Changes the label(s) of column(s) specified by column_label to labels in new_label.

Transformation (creates a new table)

Table.copy(*[, shallow]) Return a copy of a Table.
Table.select(*column_label_or_labels) Returns a new Table with only the columns in column_label_or_labels.
Table.drop(*column_label_or_labels) Return a Table with only columns other than selected label or labels.
Table.take() Return a new Table with selected rows taken by index.
Table.exclude() Return a new Table without a sequence of rows excluded by number.
Table.where(column_or_label[, ...]) Return a new Table containing rows where value_or_predicate returns True for values in column_or_label.
Table.sort(column_or_label[, descending, ...]) Return a Table of rows sorted according to the values in a column.
Table.group(column_or_label[, collect]) Group rows by unique values in a column; count or aggregate others.
Table.groups(labels[, collect]) Group rows by multiple columns, count or aggregate others.
Table.pivot(columns, rows[, values, ...]) Generate a table with a column for each unique value in columns, with rows for each unique value in rows.
Table.stack(key[, labels]) Takes k original columns and returns two columns, with col.
Table.join(column_label, other[, other_label]) Creates a new table with the columns of self and other, containing rows for all values of a column that appear in both tables.
Table.stats([ops]) Compute statistics for each column and place them in a table.
Table.percentile(p) Returns a new table with one row containing the pth percentile for each column.
Table.sample([k, with_replacement, weights]) Returns a new table where k rows are randomly sampled from the original table.
Table.split(k) Returns a tuple of two tables where the first table contains k rows randomly sampled and the second contains the remaining rows.
Table.bin([select]) Group values by bin and compute counts per bin by column.

Exporting / Displaying

Table.show([max_rows]) Display the table.
Table.as_text([max_rows, sep]) Format table as text.
Table.as_html([max_rows]) Format table as HTML.
Table.index_by(column_or_label) Return a dict keyed by values in a column that contains lists of rows corresponding to each value.
Table.to_array() Convert the table to a structured NumPy array.
Table.to_df() Convert the table to a Pandas DataFrame.
Table.to_csv(filename) Creates a CSV file with the provided filename.

Visualizations

Table.plot([column_for_xticks, select, ...]) Plot line charts for the table.
Table.bar([column_for_categories, select, ...]) Plot bar charts for the table.
Table.barh([column_for_categories, select, ...]) Plot horizontal bar charts for the table.
Table.pivot_hist(pivot_column_label, ...[, ...]) Draw histograms of each category in a column.
Table.hist([select, overlay, bins, counts, unit]) Plots one histogram for each column in the table.
Table.scatter(column_for_x[, select, ...]) Creates scatterplots, optionally adding a line of best fit.
Table.boxplot(**vargs) Plots a boxplot for the table.

Maps (datascience.maps)

Draw maps using folium.

class datascience.maps.Map(features=(), ids=(), width=960, height=500, **kwargs)[source]

A map from IDs to features. Keyword args are forwarded to folium.

color(values, ids=(), key_on='feature.id', palette='YlOrBr', **kwargs)[source]

Color map features by binning values.

values – a sequence of values or a table of keys and values ids – an ID for each value; if none are provided, indices are used key_on – attribute of each feature to match to ids palette – one of the following color brewer palettes:

‘BuGn’, ‘BuPu’, ‘GnBu’, ‘OrRd’, ‘PuBu’, ‘PuBuGn’, ‘PuRd’, ‘RdPu’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, and ‘YlOrRd’.

Defaults from Folium:

threshold_scale: list, default None
Data range for D3 threshold scale. Defaults to the following range of quantiles: [0, 0.5, 0.75, 0.85, 0.9], rounded to the nearest order-of-magnitude integer. Ex: 270 rounds to 200, 5600 to 6000.
fill_opacity: float, default 0.6
Area fill opacity, range 0-1.
line_color: string, default ‘black’
GeoJSON geopath line color.
line_weight: int, default 1
GeoJSON geopath line weight.
line_opacity: float, default 1
GeoJSON geopath line opacity, range 0-1.
legend_name: string, default None
Title for data legend. If not passed, defaults to columns[1].
copy()[source]

Copies the current Map into a new one and returns it.

features
format(**kwargs)[source]

Apply formatting.

geojson()[source]

Render features as a FeatureCollection.

overlay(feature, color='Blue', opacity=0.6)[source]

Overlays feature on the map. Returns a new Map.

Args:
feature: a Table of map features, a list of map features,
a Map, a Region, or a circle marker map table. The features will be overlayed on the Map with specified color.

color (str): Color of feature. Defaults to ‘Blue’

opacity (float): Opacity of overlain feature. Defaults to
0.6.
Returns:
A new Map with the overlain feature.
classmethod read_geojson(path_or_json_or_string)[source]

Read a geoJSON string, object, or file. Return a dict of features keyed by ID.

class datascience.maps.Marker(lat, lon, popup='', color='blue', **kwargs)[source]

A marker displayed with Folium’s simple_marker method.

popup – text that pops up when marker is clicked color – fill color

Defaults from Folium:

marker_icon: string, default ‘info-sign’
icon from (http://getbootstrap.com/components/) you want on the marker
clustered_marker: boolean, default False
boolean of whether or not you want the marker clustered with other markers
icon_angle: int, default 0
angle of icon
popup_width: int, default 300
width of popup
copy()[source]

Return a deep copy

format(**kwargs)[source]

Apply formatting.

geojson(feature_id)[source]

GeoJSON representation of the marker as a point.

lat_lons
classmethod map(latitudes, longitudes, labels=None, colors=None, radii=None, **kwargs)[source]

Return markers from columns of coordinates, labels, & colors.

The radii column is not applicable to markers, but sets circle radius.

classmethod map_table(table, **kwargs)[source]

Return markers from the colums of a table.

class datascience.maps.Circle(lat, lon, popup='', color='blue', radius=10, **kwargs)[source]

A marker displayed with Folium’s circle_marker method.

popup – text that pops up when marker is clicked color – fill color radius – pixel radius of the circle

Defaults from Folium:

fill_opacity: float, default 0.6
Circle fill opacity

For example, to draw three circles:

t = Table().with_columns([
‘lat’, [37.8, 38, 37.9], ‘lon’, [-122, -122.1, -121.9], ‘label’, [‘one’, ‘two’, ‘three’], ‘color’, [‘red’, ‘green’, ‘blue’], ‘radius’, [3000, 4000, 5000],

])

Circle.map_table(t)

class datascience.maps.Region(geojson, **kwargs)[source]

A GeoJSON feature displayed with Folium’s geo_json method.

copy()[source]

Return a deep copy

format(**kwargs)[source]

Apply formatting.

geojson(feature_id)[source]

Return GeoJSON with ID substituted.

lat_lons

A flat list of (lat, lon) pairs.

polygons

Return a list of polygons describing the region.

  • Each polygon is a list of linear rings, where the first describes the exterior and the rest describe interior holes.
  • Each linear ring is a list of positions where the last is a repeat of the first.
  • Each position is a (lat, lon) pair.
properties
type

The GEOJSON type of the regions: Polygon or MultiPolygon.

Formats (datascience.formats)

String formatting for table entries.

class datascience.formats.Formatter(min_width=None, max_width=None, etc=None)[source]

String formatter that truncates long values.

static convert(value)[source]

Identity conversion (override to convert values).

converts_values

Whether this Formatter also converts values.

etc = ' ...'
format_column(label, column)[source]

Return a formatting function that pads & truncates values.

static format_value(value)[source]

Pretty-print an arbitrary value.

max_width = 60
min_width = 4
class datascience.formats.NumberFormatter(decimals=2, decimal_point='.', separator=', ')[source]

Format numbers that may have delimiters.

convert(value)[source]

Convert string 93,000.00 to float 93000.0.

converts_values = True
format_value(value)[source]
class datascience.formats.CurrencyFormatter(symbol='$', *args, **vargs)[source]

Format currency and convert to float.

convert(value)[source]

Convert value to float. If value is a string, ensure that the first character is the same as symbol ie. the value is in the currency this formatter is representing.

converts_values = True
format_value(value)[source]

Format currency.

class datascience.formats.DateFormatter(format='%Y-%m-%d %H:%M:%S.%f', *args, **vargs)[source]

Format date & time and convert to UNIX timestamp.

convert(value)[source]

Convert 2015-08-03 to a Unix timestamp int.

converts_values = True
format_value(value)[source]

Format timestamp as a string.

class datascience.formats.PercentFormatter(decimals=2, *args, **vargs)[source]

Format a number as a percentage.

converts_values = False
format_value(value)[source]

Format number as percentage.

Utility Functions (datascience.util)

Utility functions

datascience.util.make_array(*elements)[source]

Returns an array containing all the arguments passed to this function. A simple way to make an array with a few elements.

As with any array, all arguments should have the same type.

>>> make_array(0)
array([0])
>>> make_array(2, 3, 4)
array([2, 3, 4])
>>> make_array("foo", "bar")
array(['foo', 'bar'],
      dtype='<U3')
>>> make_array()
array([], dtype=float64)
datascience.util.percentile(p, arr=None)[source]

Returns the pth percentile of the input array (the value that is at least as great as p% of the values in the array).

If arr is not provided, percentile returns itself curried with p

>>> percentile(74.9, [1, 3, 5, 9])
5
>>> percentile(75, [1, 3, 5, 9])
5
>>> percentile(75.1, [1, 3, 5, 9])
9
>>> f = percentile(75)
>>> f([1, 3, 5, 9])
5
datascience.util.plot_cdf_area(rbound=None, lbound=None, mean=0, sd=1)

Plots a normal curve with specified parameters and area below curve shaded between lbound and rbound.

Args:

rbound (numeric): right boundary of shaded region

lbound (numeric): left boundary of shaded region; by default is negative infinity

mean (numeric): mean/expectation of normal distribution

sd (numeric): standard deviation of normal distribution

datascience.util.plot_normal_cdf(rbound=None, lbound=None, mean=0, sd=1)[source]

Plots a normal curve with specified parameters and area below curve shaded between lbound and rbound.

Args:

rbound (numeric): right boundary of shaded region

lbound (numeric): left boundary of shaded region; by default is negative infinity

mean (numeric): mean/expectation of normal distribution

sd (numeric): standard deviation of normal distribution

datascience.util.table_apply(table, func, subset=None)[source]

Applies a function to each column and returns a Table.

Uses pandas apply under the hood, then converts back to a Table

Args:
table : instance of Table
The table to apply your function to
func : function
Any function that will work with DataFrame.apply
subset : list | None
A list of columns to apply the function to. If None, function will be applied to all columns in table
tab : instance of Table
A table with the given function applied. It will either be the shape == shape(table), or shape (1, table.shape[1])
datascience.util.proportions_from_distribution(table, label, sample_size, column_name='Random Sample')[source]

Adds a column named column_name containing the proportions of a random draw using the distribution in label.

This method uses np.random.multinomial to draw sample_size samples from the distribution in table.column(label), then divides by sample_size to create the resulting column of proportions.

Returns a new Table and does not modify table.

Args:

table: An instance of Table.

label: Label of column in table. This column must contain a
distribution (the values must sum to 1).

sample_size: The size of the sample to draw from the distribution.

column_name: The name of the new column that contains the sampled
proportions. Defaults to 'Random Sample'.
Returns:
A copy of table with a column column_name containing the sampled proportions. The proportions will sum to 1.
Throws:
ValueError: If the label is not in the table, or if
table.column(label) does not sum to 1.
datascience.util.minimize(f, start=None, smooth=False, log=None, array=False, **vargs)[source]

Minimize a function f of one or more arguments.

Args:

f: A function that takes numbers and returns a number

start: A starting value or list of starting values

smooth: Whether to assume that f is smooth and use first-order info

log: Logging function called on the result of optimization (e.g. print)

vargs: Other named arguments passed to scipy.optimize.minimize

Returns either:
  1. the minimizing argument of a one-argument function
  2. an array of minimizing arguments of a multi-argument function