Welcome to datascience’s documentation!¶
- Release:
0.17.6
- Date:
Sep 24, 2023
The datascience
package was written for use in Berkeley’s DS 8 course and
contains useful functionality for investigating and graphically displaying data.
Start Here: datascience
Tutorial¶
This is a brief introduction to the functionality in
datascience
. For a complete reference guide, please see
Tables (datascience.tables).
For other useful tutorials and examples, see:
Getting Started¶
The most important functionality in the package is is the Table
class, which is the structure used to represent columns of data. First, load
the class:
In [1]: from datascience import Table
In the IPython notebook, type Table.
followed by the TAB-key to see a list
of members.
Note that for the Data Science 8 class we also import additional packages and settings for all assignments and labs. This is so that plots and other available packages mirror the ones in the textbook more closely. The exact code we use is:
# HIDDEN
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')
In particular, the lines involving matplotlib
allow for plotting within the
IPython notebook.
Creating a Table¶
A Table is a sequence of labeled columns of data.
A Table can be constructed from scratch by extending an empty table with columns.
In [2]: t = Table().with_columns(
...: 'letter', ['a', 'b', 'c', 'z'],
...: 'count', [ 9, 3, 3, 1],
...: 'points', [ 1, 2, 2, 10],
...: )
...:
In [3]: print(t)
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
More often, a table is read from a CSV file (or an Excel spreadsheet). Here’s the content of an example file:
In [4]: cat sample.csv
x,y,z
1,10,100
2,11,101
3,12,102
And this is how we load it in as a Table
using
read_table()
:
In [5]: Table.read_table('sample.csv')
Out[5]:
x | y | z
1 | 10 | 100
2 | 11 | 101
3 | 12 | 102
CSVs from URLs are also valid inputs to
read_table()
:
In [6]: Table.read_table('https://www.inferentialthinking.com/data/sat2014.csv')
Out[6]:
State | Participation Rate | Critical Reading | Math | Writing | Combined
North Dakota | 2.3 | 612 | 620 | 584 | 1816
Illinois | 4.6 | 599 | 616 | 587 | 1802
Iowa | 3.1 | 605 | 611 | 578 | 1794
South Dakota | 2.9 | 604 | 609 | 579 | 1792
Minnesota | 5.9 | 598 | 610 | 578 | 1786
Michigan | 3.8 | 593 | 610 | 581 | 1784
Wisconsin | 3.9 | 596 | 608 | 578 | 1782
Missouri | 4.2 | 595 | 597 | 579 | 1771
Wyoming | 3.3 | 590 | 599 | 573 | 1762
Kansas | 5.3 | 591 | 596 | 566 | 1753
... (41 rows omitted)
It’s also possible to add columns from a dictionary, but this option is discouraged because dictionaries do not preserve column order.
In [7]: t = Table().with_columns({
...: 'letter': ['a', 'b', 'c', 'z'],
...: 'count': [ 9, 3, 3, 1],
...: 'points': [ 1, 2, 2, 10],
...: })
...:
In [8]: print(t)
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
Accessing Values¶
To access values of columns in the table, use
column()
, which takes a column label or index
and returns an array. Alternatively, columns()
returns a list of columns (arrays).
In [9]: t
Out[9]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [10]: t.column('letter')
Out[10]:
array(['a', 'b', 'c', 'z'],
dtype='<U1')
In [11]: t.column(1)
Out[11]: array([9, 3, 3, 1])
You can use bracket notation as a shorthand for this method:
In [12]: t['letter'] # This is a shorthand for t.column('letter')
Out[12]:
array(['a', 'b', 'c', 'z'],
dtype='<U1')
In [13]: t[1] # This is a shorthand for t.column(1)
Out[13]: array([9, 3, 3, 1])
To access values by row, row()
returns a
row by index. Alternatively, rows()
returns an
list-like Rows
object that contains
tuple-like Row
objects.
In [14]: t.rows
Out[14]:
Rows(letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10)
In [15]: t.rows[0]
Out[15]: Row(letter='a', count=9, points=1)
In [16]: t.row(0)
Out[16]: Row(letter='a', count=9, points=1)
In [17]: second = t.rows[1]
In [18]: second
Out[18]: Row(letter='b', count=3, points=2)
In [19]: second[0]
Out[19]: 'b'
In [20]: second[1]
Out[20]: 3
To get the number of rows, use num_rows
.
In [21]: t.num_rows
Out[21]: 4
Manipulating Data¶
Here are some of the most common operations on data. For the rest, see the reference (Tables (datascience.tables)).
Adding a column with with_column()
:
In [22]: t
Out[22]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [23]: t.with_column('vowel?', ['yes', 'no', 'no', 'no'])
Out[23]:
letter | count | points | vowel?
a | 9 | 1 | yes
b | 3 | 2 | no
c | 3 | 2 | no
z | 1 | 10 | no
In [24]: t # .with_column returns a new table without modifying the original
Out[24]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [25]: t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns
Out[25]:
letter | count | points | 2 * count
a | 9 | 1 | 18
b | 3 | 2 | 6
c | 3 | 2 | 6
z | 1 | 10 | 2
Selecting columns with select()
:
In [26]: t.select('letter')
Out[26]:
letter
a
b
c
z
In [27]: t.select(['letter', 'points'])
Out[27]:
letter | points
a | 1
b | 2
c | 2
z | 10
Renaming columns with relabeled()
:
In [28]: t
Out[28]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [29]: t.relabeled('points', 'other name')
Out[29]:
letter | count | other name
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [30]: t
Out[30]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [31]: t.relabeled(['letter', 'count', 'points'], ['x', 'y', 'z'])
Out[31]:
x | y | z
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
Selecting out rows by index with take()
and
conditionally with where()
:
In [32]: t
Out[32]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [33]: t.take(2) # the third row
Out[33]:
letter | count | points
c | 3 | 2
In [34]: t.take[0:2] # the first and second rows
Out[34]:
letter | count | points
a | 9 | 1
b | 3 | 2
In [35]: t.where('points', 2) # rows where points == 2
Out[35]:
letter | count | points
b | 3 | 2
c | 3 | 2
In [36]: t.where(t['count'] < 8) # rows where count < 8
Out[36]:
letter | count | points
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [37]: t['count'] < 8 # .where actually takes in an array of booleans
Out[37]: array([False, True, True, True], dtype=bool)
In [38]: t.where([False, True, True, True]) # same as the last line
Out[38]:
letter | count | points
b | 3 | 2
c | 3 | 2
z | 1 | 10
Operate on table data with sort()
,
group()
, and
pivot()
In [39]: t
Out[39]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [40]: t.sort('count')
Out[40]:
letter | count | points
z | 1 | 10
b | 3 | 2
c | 3 | 2
a | 9 | 1
In [41]: t.sort('letter', descending = True)
Out[41]:
letter | count | points
z | 1 | 10
c | 3 | 2
b | 3 | 2
a | 9 | 1
# You may pass a reducing function into the collect arg
# Note the renaming of the points column because of the collect arg
In [42]: t.select(['count', 'points']).group('count', collect=sum)
Out[42]:
count | points sum
1 | 10
3 | 4
9 | 1
In [43]: other_table = Table().with_columns(
....: 'mar_status', ['married', 'married', 'partner', 'partner', 'married'],
....: 'empl_status', ['Working as paid', 'Working as paid', 'Not working',
....: 'Not working', 'Not working'],
....: 'count', [1, 1, 1, 1, 1])
....:
In [44]: other_table
Out[44]:
mar_status | empl_status | count
married | Working as paid | 1
married | Working as paid | 1
partner | Not working | 1
partner | Not working | 1
married | Not working | 1
In [45]: other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)
Out[45]:
empl_status | married | partner
Not working | 1 | 2
Working as paid | 2 | 0
Visualizing Data¶
We’ll start with some data drawn at random from two normal distributions:
In [46]: normal_data = Table().with_columns(
....: 'data1', np.random.normal(loc = 1, scale = 2, size = 100),
....: 'data2', np.random.normal(loc = 4, scale = 3, size = 100))
....:
In [47]: normal_data
Out[47]:
data1 | data2
-1.73346 | 5.25333
2.05467 | 4.6461
2.24119 | -0.158949
-2.1273 | 3.70496
4.82042 | 4.9243
1.40334 | 0.276988
2.60555 | 9.7636
-2.49814 | -4.30364
1.57132 | 4.89357
-0.0973409 | 1.47964
... (90 rows omitted)
Draw histograms with hist()
:
In [48]: normal_data.hist()

In [49]: normal_data.hist(bins = range(-5, 10))

In [50]: normal_data.hist(bins = range(-5, 10), overlay = True)

If we treat the normal_data
table as a set of x-y points, we can
plot()
and
scatter()
:
In [51]: normal_data.sort('data1').plot('data1') # Sort first to make plot nicer

In [52]: normal_data.scatter('data1')

In [53]: normal_data.scatter('data1', fit_line = True)

Use barh()
to display categorical data.
In [54]: t
Out[54]:
letter | count | points
a | 9 | 1
b | 3 | 2
c | 3 | 2
z | 1 | 10
In [55]: t.barh('letter')

Exporting¶
Exporting to CSV is the most common operation and can be done by first
converting to a pandas dataframe with to_df()
:
In [56]: normal_data
Out[56]:
data1 | data2
-1.73346 | 5.25333
2.05467 | 4.6461
2.24119 | -0.158949
-2.1273 | 3.70496
4.82042 | 4.9243
1.40334 | 0.276988
2.60555 | 9.7636
-2.49814 | -4.30364
1.57132 | 4.89357
-0.0973409 | 1.47964
... (90 rows omitted)
# index = False prevents row numbers from appearing in the resulting CSV
In [57]: normal_data.to_df().to_csv('normal_data.csv', index = False)
An Example¶
We’ll recreate the steps in Chapter 12 of the textbook to see if there is a significant difference in birth weights between smokers and non-smokers using a bootstrap test.
For more examples, check out the TableDemos repo.
From the text:
The table
baby
contains data on a random sample of 1,174 mothers and their newborn babies. The columnBirth Weight
contains the birth weight of the baby, in ounces;Gestational Days
is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.
In [58]: baby = Table.read_table('https://www.inferentialthinking.com/data/baby.csv')
In [59]: baby # Let's take a peek at the table
Out[59]:
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker
120 | 284 | 27 | 62 | 100 | False
113 | 282 | 33 | 64 | 135 | False
128 | 279 | 28 | 64 | 115 | True
108 | 282 | 23 | 67 | 125 | True
136 | 286 | 25 | 62 | 93 | False
138 | 244 | 33 | 62 | 178 | False
132 | 245 | 23 | 65 | 140 | False
120 | 289 | 25 | 62 | 125 | False
143 | 299 | 30 | 66 | 136 | True
140 | 351 | 27 | 68 | 120 | False
... (1164 rows omitted)
# Select out columns we want.
In [60]: smoker_and_wt = baby.select(['Maternal Smoker', 'Birth Weight'])
In [61]: smoker_and_wt
Out[61]:
Maternal Smoker | Birth Weight
False | 120
False | 113
True | 128
True | 108
False | 136
False | 138
False | 132
False | 120
True | 143
False | 140
... (1164 rows omitted)
Let’s compare the number of smokers to non-smokers.
In [62]: smoker_and_wt.select('Maternal Smoker').group('Maternal Smoker')
Out[62]:
Maternal Smoker | count
False | 715
True | 459
We can also compare the distribution of birthweights between smokers and non-smokers.
# Non smokers
# We do this by grabbing the rows that correspond to mothers that don't
# smoke, then plotting a histogram of just the birthweights.
In [63]: smoker_and_wt.where('Maternal Smoker', 0).select('Birth Weight').hist()
# Smokers
In [64]: smoker_and_wt.where('Maternal Smoker', 1).select('Birth Weight').hist()


What’s the difference in mean birth weight of the two categories?
In [65]: nonsmoking_mean = smoker_and_wt.where('Maternal Smoker', 0).column('Birth Weight').mean()
In [66]: smoking_mean = smoker_and_wt.where('Maternal Smoker', 1).column('Birth Weight').mean()
In [67]: observed_diff = nonsmoking_mean - smoking_mean
In [68]: observed_diff
Out[68]: 9.2661425720249184
Let’s do the bootstrap test on the two categories.
In [69]: num_nonsmokers = smoker_and_wt.where('Maternal Smoker', 0).num_rows
In [70]: def bootstrap_once():
....: """
....: Computes one bootstrapped difference in means.
....: The table.sample method lets us take random samples.
....: We then split according to the number of nonsmokers in the original sample.
....: """
....: resample = smoker_and_wt.sample(with_replacement = True)
....: bootstrap_diff = resample.column('Birth Weight')[:num_nonsmokers].mean() - \
....: resample.column('Birth Weight')[num_nonsmokers:].mean()
....: return bootstrap_diff
....:
In [71]: repetitions = 1000
In [72]: bootstrapped_diff_means = np.array(
....: [ bootstrap_once() for _ in range(repetitions) ])
....:
In [73]: bootstrapped_diff_means[:10]
Out[73]:
array([ 1.55503451, -0.48330058, 2.07701144, 1.40825754, -0.66256227,
1.5765102 , -0.29477581, 0.14779469, 0.78852781, -0.40793455])
In [74]: num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum()
In [75]: p_value = num_diffs_greater / len(bootstrapped_diff_means)
In [76]: p_value
Out[76]: 0.0
Drawing Maps¶
To come.
Data 8 datascience
Reference¶
This notebook serves as an interactive, Data 8-friendly reference for the datascience
library.
Table
Functions and Methods¶
Table()
¶
Create an empty table, usually to extend with data
[29]:
new_table = Table()
new_table
[29]:
[30]:
type(new_table)
[30]:
datascience.tables.Table
Table.read_table()
¶
Table.read_table(filename)
Creates a table by reading the CSV file named filename
(a string).
[31]:
trips = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/trip.csv')
trips
[31]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | 10/31/2014 16:43 | Civic Center BART (7th at Market) | 72 | 630 | Subscriber | 94706 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 10/28/2014 8:54 | 2nd at South Park | 64 | 458 | Subscriber | 94610 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | 6/4/2015 5:36 | Embarcadero at Sansome | 60 | 289 | Subscriber | 94105 |
... (99990 rows omitted)
tbl.with_column
¶
tbl = Table()
tbl.with_column(name, values)
tbl.with_columns(n1, v1, n2, v2,...)
Creates a new table by adding a column with name name
and values values
to another table. name
should be a string and values
should have as many entries as there are rows in the original table. If values
is a single value, then every row of that column has the value values
.
In the examples below, we start with adding a column to the existing table trips
with values
being an array we construct from existing tables.
[32]:
trips.with_column(
"Difference in terminal", abs(trips.column("Start Terminal") - trips.column("End Terminal"))
)
[32]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code | Difference in terminal |
---|---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 | 7 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 | 10 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 | 2 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 | 2 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 | 2 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 | 7 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | 10/31/2014 16:43 | Civic Center BART (7th at Market) | 72 | 630 | Subscriber | 94706 | 7 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 10/28/2014 8:54 | 2nd at South Park | 64 | 458 | Subscriber | 94610 | 13 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 | 0 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | 6/4/2015 5:36 | Embarcadero at Sansome | 60 | 289 | Subscriber | 94105 | 6 |
... (99990 rows omitted)
We can also create a new table by adding two new columns with column name followed by the array values.
[33]:
cookies = Table()
cookies = cookies.with_columns(
"Cookie", make_array("Sugar cookies", "Chocolate chip", "Red velvet", "Oatmeal raisin", "Peanut butter"),
"Quantity", make_array(10, 15, 15, 10, 5)
)
cookies
[33]:
Cookie | Quantity |
---|---|
Sugar cookies | 10 |
Chocolate chip | 15 |
Red velvet | 15 |
Oatmeal raisin | 10 |
Peanut butter | 5 |
[34]:
prices = make_array(1.00, 1.50, 1.75, 1.25, 1.00)
cookies = cookies.with_column("Price ($)", prices)
cookies
[34]:
Cookie | Quantity | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
In the last examples, we add a new column Delicious
with one value “yes,” and we see every column has the same value.
[35]:
cookies.with_column("Delicious", "yes")
[35]:
Cookie | Quantity | Price ($) | Delicious |
---|---|---|---|
Sugar cookies | 10 | 1 | yes |
Chocolate chip | 15 | 1.5 | yes |
Red velvet | 15 | 1.75 | yes |
Oatmeal raisin | 10 | 1.25 | yes |
Peanut butter | 5 | 1 | yes |
tbl.column()
¶
tbl.column(column_name_or_index)
Outputs an array of values of the column column_name_or_index
. column_name_or_index
is a string of the column name or number which is the index of the column.
In the examples below, we start with an array of the Cookie
column from the table cookies
first by the column name then by using the index of the column.
[36]:
cookies.column("Cookie")
[36]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
'Peanut butter'], dtype='<U14')
[37]:
cookies.column(0)
[37]:
array(['Sugar cookies', 'Chocolate chip', 'Red velvet', 'Oatmeal raisin',
'Peanut butter'], dtype='<U14')
tbl.num_rows
¶
Computes the number of rows in a table.
[38]:
trips.num_rows
[38]:
100000
[39]:
cookies.num_rows
[39]:
5
tbl.num_columns
¶
Computes the number of columns in a table.
[40]:
trips.num_columns
[40]:
11
[41]:
cookies.num_columns
[41]:
3
tbl.labels
¶
Outputs the column labels in a table.
[42]:
trips.labels
[42]:
('Trip ID',
'Duration',
'Start Date',
'Start Station',
'Start Terminal',
'End Date',
'End Station',
'End Terminal',
'Bike #',
'Subscriber Type',
'Zip Code')
[43]:
cookies.labels
[43]:
('Cookie', 'Quantity', 'Price ($)')
tbl.select()
¶
tbl.select(col1, col2, ...)
Creates a copy of a table with only the selected columns. Each column is the column name as a string or the integer index of the column.
Suppose we want to select the Trip ID
, Duration
, Bike #
, and Zip Code
columns from the trips
table.
[44]:
trips.select("Trip ID", "Duration", "Bike #", "Zip Code")
[44]:
Trip ID | Duration | Bike # | Zip Code |
---|---|---|---|
876419 | 413 | 269 | 94518 |
459672 | 408 | 429 | 94111 |
903647 | 723 | 631 | 94025 |
452829 | 409 | 428 | 94925 |
491023 | 224 | 144 | 94117 |
723352 | 519 | 629 | 94061 |
524499 | 431 | 630 | 94706 |
518524 | 389 | 458 | 94610 |
710070 | 11460 | 375 | 94107 |
793149 | 616 | 289 | 94105 |
... (99990 rows omitted)
Similarly, we can use indexes to select columns. Remember to start indexing at 0.
[45]:
trips.select(0, 1, 8, 10).show(5)
Trip ID | Duration | Bike # | Zip Code |
---|---|---|---|
876419 | 413 | 269 | 94518 |
459672 | 408 | 429 | 94111 |
903647 | 723 | 631 | 94025 |
452829 | 409 | 428 | 94925 |
491023 | 224 | 144 | 94117 |
... (99995 rows omitted)
tbl.drop()
¶
tbl.drop(col1, col2, ...)
Creates a copy of a table without the specified columns. Each column is the column name as a string or integer index.
[46]:
cookies.drop("Quantity")
[46]:
Cookie | Price ($) |
---|---|
Sugar cookies | 1 |
Chocolate chip | 1.5 |
Red velvet | 1.75 |
Oatmeal raisin | 1.25 |
Peanut butter | 1 |
[47]:
trips.drop("End Date", "Subscriber Type")
[47]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Station | End Terminal | Bike # | Zip Code |
---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | Townsend at 7th | 65 | 269 | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | Embarcadero at Sansome | 60 | 429 | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | Market at 10th | 67 | 631 | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | Market at 4th | 76 | 428 | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | San Jose Diridon Caltrain Station | 2 | 144 | 94117 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | 94061 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | Civic Center BART (7th at Market) | 72 | 630 | 94706 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 2nd at South Park | 64 | 458 | 94610 |
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | Powell Street BART | 39 | 375 | 94107 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | Embarcadero at Sansome | 60 | 289 | 94105 |
... (99990 rows omitted)
[48]:
trips.drop(3, 6, 8, 9, 10)
[48]:
Trip ID | Duration | Start Date | Start Terminal | End Date | End Terminal |
---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | 72 | 8/5/2015 8:36 | 65 |
459672 | 408 | 9/18/2014 17:11 | 50 | 9/18/2014 17:17 | 60 |
903647 | 723 | 8/25/2015 7:26 | 69 | 8/25/2015 7:38 | 67 |
452829 | 409 | 9/15/2014 8:29 | 74 | 9/15/2014 8:36 | 76 |
491023 | 224 | 10/9/2014 16:13 | 4 | 10/9/2014 16:17 | 2 |
723352 | 519 | 4/13/2015 17:04 | 63 | 4/13/2015 17:12 | 70 |
524499 | 431 | 10/31/2014 16:36 | 65 | 10/31/2014 16:43 | 72 |
518524 | 389 | 10/28/2014 8:48 | 77 | 10/28/2014 8:54 | 64 |
710070 | 11460 | 4/2/2015 18:13 | 39 | 4/2/2015 21:24 | 39 |
793149 | 616 | 6/4/2015 5:26 | 54 | 6/4/2015 5:36 | 60 |
... (99990 rows omitted)
tbl.relabel()
¶
tbl.relabel(old_label, new_label)
Modifies the table by changing the label of the column named old_label
to new_label
. old_label
can be a string column name or an integer index.
[49]:
cookies
[49]:
Cookie | Quantity | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[50]:
cookies.relabel("Quantity", "Amount remaining")
[50]:
Cookie | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[51]:
cookies.relabel(0, "Type")
[51]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[52]:
cookies
[52]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
tbl.show()
¶
tbl.show(n)
Displays the first n
rows of a table. If no n
is provided, displays all rows.
[53]:
trips.show(5)
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 |
... (99995 rows omitted)
tbl.sort()
¶
tbl.sort(column_name_or_index, descending=False)
Sorts the rows in the table by the values in the column column_name_or_index
in ascending order by default. Set descending=True
to sort in descending order. column_name_or_index
can be a string column label or an integer index.
[54]:
cookies
[54]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[55]:
cookies.sort("Price ($)")
[55]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
Oatmeal raisin | 10 | 1.25 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[56]:
# sort in descending order
cookies.sort("Amount remaining", descending = True)
[56]:
Type | Amount remaining | Price ($) |
---|---|---|
Red velvet | 15 | 1.75 |
Chocolate chip | 15 | 1.5 |
Oatmeal raisin | 10 | 1.25 |
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
[57]:
# alphabetical order
cookies.sort(0)
[57]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
Red velvet | 15 | 1.75 |
Sugar cookies | 10 | 1 |
tbl.where()
¶
tbl.where(column, predicate)
Filters the table for rows where the predicate
is true. predicate
should be one of the provided are.<something>
functions. column
can be a string column label or an integer index. A list of available predicates can be found below.
[58]:
cookies.where("Amount remaining", are.above(10))
[58]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[59]:
cookies.where(0, are.equal_to("Chocolate chip"))
[59]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
[62]:
# if predicate is a value, look for rows where the column == the value
# equivalent to cookies.where(1, are.eual_to(15))
cookies.where(1, 15)
[62]:
Type | Amount remaining | Price ($) |
---|---|---|
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
[63]:
cookies.where("Price ($)", are.below(1.25))
[63]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Peanut butter | 5 | 1 |
tbl.take()
¶
tbl.take(row_index, ...)
Returns a copy of the table with only the specified rows included. Rows are specified by their integer index, so 0
for the first, 1
for the second, etc.
[64]:
cookies
[64]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Oatmeal raisin | 10 | 1.25 |
Peanut butter | 5 | 1 |
[65]:
cookies.take(0)
[65]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
[66]:
cookies.take(cookies.num_rows - 1)
[66]:
Type | Amount remaining | Price ($) |
---|---|---|
Peanut butter | 5 | 1 |
[67]:
cookies.take(0, 1, 2)
[67]:
Type | Amount remaining | Price ($) |
---|---|---|
Sugar cookies | 10 | 1 |
Chocolate chip | 15 | 1.5 |
Red velvet | 15 | 1.75 |
Table
Visualizations¶
[68]:
actors = Table().read_table("https://github.com/data-8/textbook/raw/gh-pages/data/actors.csv")
actors
[68]:
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
---|---|---|---|---|---|
Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 |
Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 |
Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 |
Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 |
Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 |
Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 |
... (40 rows omitted)
tbl.scatter()
¶
tbl.scatter(x_column, y_column, fit_line=False)
Creates a scatter plot with x_column
on the horizontal axis and y_column
on the vertical axis. These labels can be column names as strings or integer indices. Set fit_line=True
to include a line of best fit for the data. You can find more examples in the textbook.
[71]:
actors.scatter('Number of Movies', 'Total Gross')

[73]:
actors.scatter(2, 3, fit_line=True)

tbl.plot()
¶
tbl.plot(x_column, y_column)
Plot a line graph with x_column
on the horizontal axis and y_column
on the vertical axis. Sorts the table in ascending order by values in x_column
first. x_column
and y_column
can be column names as strings or integer indices.
[74]:
movies_by_year = Table.read_table('https://github.com/data-8/textbook/raw/gh-pages/data/movies_by_year.csv')
movies_by_year.show(3)
Year | Total Gross | Number of Movies | #1 Movie |
---|---|---|---|
2015 | 11128.5 | 702 | Star Wars: The Force Awakens |
2014 | 10360.8 | 702 | American Sniper |
2013 | 10923.6 | 688 | Catching Fire |
... (33 rows omitted)
[75]:
movies_by_year.plot('Year', 'Number of Movies')

tbl.barh()
¶
tbl.barh(categories)
tbl.barh(categories, values)
Plots a horizontal bar chart broken down by categories
as the bars. If values
is unspecified, one bar for each column of the table (except categories
) is plotted. categories
and values
can be column names as strings or integer indices.
[76]:
cookies.barh("Type")

[77]:
cookies.barh("Type", "Amount remaining")

[78]:
cookies.barh(0, 2)

tbl.hist()
¶
tbl.hist(column)
tbl.hist(column, bins=...)
Plot a histogram of the values in column
. Defaults to 10 bins of equal width. If bins
is specified, it can be a number of bins to use (e.g. bins=25
will produce a histogram with 25 bins) or an array of values to use as bins (e.g. bins=make_array(1, 3, 4)
will produce 2 bins: \([1, 3)\) and \([3, 4)\)). column
can be column names as strings or integer indices.
[79]:
actors.hist(3)

[80]:
actors.hist("Gross")

Table.interactive_plots()
¶
Table.interactive_plots()
This function will change from static plots like the ones above to interactive plots made with plotly. If a plotting method has a plotly version, that method will be used instead.
[193]:
Table.interactive_plots()
actors.scatter("Total Gross", "Gross")
Table.static_plots()
¶
Table.static_plots()
This function turns off plotly plots.
[194]:
Table.static_plots()
actors.scatter("Total Gross", "Gross")

Advanced Table
Functions¶
tbl.apply()
¶
tbl.apply(function, column)
tbl.apply(function, col1, col2, ...)
Applies the function function
to each element of the column column
and returns the values returned as an array. If function
takes more than one argument, you can specify multiple columns to use for each argument in order.
[65]:
actors.apply(np.average, "Number of Movies")
[65]:
array([41., 69., 61., 44., 53., 38., 36., 45., 58., 37., 38., 49., 60.,
43., 7., 31., 24., 50., 39., 79., 34., 63., 23., 34., 37., 36.,
17., 50., 39., 42., 35., 36., 17., 46., 40., 32., 17., 29., 39.,
26., 25., 36., 27., 43., 25., 21., 40., 35., 23., 25.])
[82]:
actors
[82]:
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
---|---|---|---|---|---|
Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 |
Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 |
Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 |
Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 |
Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 |
Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 |
... (40 rows omitted)
The example below calculates the average gross for each movie by actor by applying a function that takes in the value of Total Gross
and Number of Movies
and returns their quotient.
[83]:
def average_gross(total_gross, num_movies):
return total_gross / num_movies
actors.apply(average_gross, "Total Gross", "Number of Movies")
[83]:
array([118.82195122, 69.17101449, 73.25081967, 98.65454545,
74.47735849, 100.27368421, 99.64444444, 74.85777778,
57.78448276, 90.3027027 , 86.68421053, 66.9244898 ,
53.15666667, 73.8372093 , 451.84285714, 101.62580645,
131.2125 , 62.478 , 79.67435897, 39.00379747,
89.16764706, 46.70952381, 125.67826087, 84.86470588,
76.40540541, 78.38888889, 165.63529412, 56.316 ,
71.86153846, 65.12619048, 77.89428571, 75.425 ,
157.75882353, 58.28913043, 67.0225 , 83.15625 ,
154.96470588, 89.83103448, 66.72564103, 99.25384615,
102.308 , 70.82777778, 94.26666667, 58.65348837,
100.732 , 119.06190476, 61.5925 , 70.36 ,
106.86086957, 96.66 ])
tbl.group()
¶
tbl.group(column_or_columns)
tbl.group(column_or_columns, func)
Groups a table by values in column_or_columns
. If column_or_columns
is an array, groups by each unique combination of elements in those columns. If func
is specified, it should be a function that takes in an array of values and returns a single value. If unspecified, this defaults to the count of rows in the set.
[84]:
trips.group("Start Station")
[84]:
Start Station | count |
---|---|
2nd at Folsom | 2302 |
2nd at South Park | 2610 |
2nd at Townsend | 3904 |
5th at Howard | 2190 |
Adobe on Almaden | 165 |
Arena Green / SAP Center | 176 |
Beale at Market | 2377 |
Broadway St at Battery St | 2157 |
California Ave Caltrain Station | 127 |
Castro Street and El Camino Real | 339 |
... (60 rows omitted)
[85]:
trips.group("Start Station", np.mean).select(0,2)
[85]:
Start Station | Duration mean |
---|---|
2nd at Folsom | 512.887 |
2nd at South Park | 654.565 |
2nd at Townsend | 755.176 |
5th at Howard | 819.509 |
Adobe on Almaden | 2522.5 |
Arena Green / SAP Center | 1999.7 |
Beale at Market | 679.602 |
Broadway St at Battery St | 827.753 |
California Ave Caltrain Station | 4403.29 |
Castro Street and El Camino Real | 1221.86 |
... (60 rows omitted)
[86]:
trips.group("Start Station").sort("count", descending = True)
[86]:
Start Station | count |
---|---|
San Francisco Caltrain (Townsend at 4th) | 7426 |
San Francisco Caltrain 2 (330 Townsend) | 6114 |
Harry Bridges Plaza (Ferry Building) | 4795 |
Temporary Transbay Terminal (Howard at Beale) | 4212 |
Townsend at 7th | 3925 |
2nd at Townsend | 3904 |
Embarcadero at Sansome | 3900 |
Steuart at Market | 3872 |
Market at 10th | 3370 |
Market at Sansome | 3218 |
... (60 rows omitted)
[87]:
trips.group(['Start Station', 'End Station'])
[87]:
Start Station | End Station | count |
---|---|---|
2nd at Folsom | 2nd at Folsom | 22 |
2nd at Folsom | 2nd at South Park | 84 |
2nd at Folsom | 2nd at Townsend | 123 |
2nd at Folsom | 5th at Howard | 28 |
2nd at Folsom | Beale at Market | 34 |
2nd at Folsom | Broadway St at Battery St | 18 |
2nd at Folsom | Civic Center BART (7th at Market) | 13 |
2nd at Folsom | Clay at Battery | 70 |
2nd at Folsom | Commercial at Montgomery | 46 |
2nd at Folsom | Davis at Jackson | 8 |
... (1616 rows omitted)
tbl.pivot()
¶
tbl.pivot(col1, col2)
tbl.pivot(col1, col2, values, collect)
Creates a pivot table with values in col1
as columns and values in col2
as rows. If values
is unspecified, the values in the cells default to counts. If values
is specified, it should be the label of a column whose values to pass as an array to collect
, which should return a single value.
[88]:
more_cones = Table().with_columns(
'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)
more_cones
[88]:
Flavor | Color | Price |
---|---|---|
strawberry | pink | 3.55 |
chocolate | light brown | 4.75 |
chocolate | dark brown | 5.25 |
strawberry | pink | 5.25 |
chocolate | dark brown | 5.25 |
bubblegum | pink | 4.75 |
[89]:
more_cones.pivot('Flavor', 'Color')
[89]:
Color | bubblegum | chocolate | strawberry |
---|---|---|---|
dark brown | 0 | 2 | 0 |
light brown | 0 | 1 | 0 |
pink | 1 | 0 | 2 |
[90]:
more_cones.pivot('Flavor', 'Color', values='Price', collect=sum)
[90]:
Color | bubblegum | chocolate | strawberry |
---|---|---|---|
dark brown | 0 | 10.5 | 0 |
light brown | 0 | 4.75 | 0 |
pink | 4.75 | 0 | 8.8 |
[91]:
more_cones.pivot(0, 1)
[91]:
Color | bubblegum | chocolate | strawberry |
---|---|---|---|
dark brown | 0 | 2 | 0 |
light brown | 0 | 1 | 0 |
pink | 1 | 0 | 2 |
tbl.join()
¶
tbl1.join(col1, tbl2)
tbl1.join(col1, tbl2, col2)
Performs a join of tbl1
on tbl2
where rows are only included if the value in col1
is present in both join columns. If col2
is unspecified, it is assumed to be the same label as col1
.
[92]:
cones = Table().with_columns(
'Flavor', make_array('strawberry', 'vanilla', 'chocolate', 'strawberry', 'chocolate'),
'Price', make_array(3.55, 4.75, 6.55, 5.25, 5.75)
)
cones
[92]:
Flavor | Price |
---|---|
strawberry | 3.55 |
vanilla | 4.75 |
chocolate | 6.55 |
strawberry | 5.25 |
chocolate | 5.75 |
[95]:
ratings = Table().with_columns(
'Kind', make_array('strawberry', 'chocolate', 'vanilla', 'mint chip'),
'Stars', make_array(2.5, 3.5, 4, 3)
)
ratings
[95]:
Kind | Stars |
---|---|
strawberry | 2.5 |
chocolate | 3.5 |
vanilla | 4 |
mint chip | 3 |
[97]:
# Joins cones on ratings. Note that the mint chip flavor doesn't appear since it's not in cones
rated = cones.join('Flavor', ratings, 'Kind')
rated
[97]:
Flavor | Price | Stars |
---|---|---|
chocolate | 6.55 | 3.5 |
chocolate | 5.75 | 3.5 |
strawberry | 3.55 | 2.5 |
strawberry | 5.25 | 2.5 |
vanilla | 4.75 | 4 |
tbl.sample()
¶
tbl.sample(n, with_replacement=True)
Returns a new table with n
rows that were randomly sampled from the original table. If with_replacement
is true, sampling occurs with replacement. For sampling without replacement, set with_replacement=False
.
[98]:
# if you rerun this cell, you should get different results since the sample is random
rated.sample(2)
[98]:
Flavor | Price | Stars |
---|---|---|
chocolate | 6.55 | 3.5 |
chocolate | 6.55 | 3.5 |
Notice how the table below has more rows for certain flavors than the original rated table. This is because we are sampling with replacement, so you get theoretically get 5 of the same flavors!
[99]:
sampled_with_replacement = rated.sample(5)
sampled_with_replacement
[99]:
Flavor | Price | Stars |
---|---|---|
strawberry | 5.25 | 2.5 |
strawberry | 3.55 | 2.5 |
strawberry | 3.55 | 2.5 |
chocolate | 6.55 | 3.5 |
vanilla | 4.75 | 4 |
[100]:
rated.sample(3, with_replacement = False)
[100]:
Flavor | Price | Stars |
---|---|---|
vanilla | 4.75 | 4 |
strawberry | 3.55 | 2.5 |
chocolate | 6.55 | 3.5 |
String Methods¶
str.split()
¶
string.split(separator)
Splits the string string
into a list on each occurence of the substring separator
. The occurences of separator
are removed from the resulting list.
For example, the code below splits the string Data 8hiishifun.
on the substring hi
.
[101]:
example_string = "Data 8hiishifun."
example_string.split("hi")
[101]:
['Data 8', 'is', 'fun.']
[104]:
# split on .
another_string = "the.secret.message.is.123"
another_string.split(".")
[104]:
['the', 'secret', 'message', 'is', '123']
str.join()
¶
string.join(array)
Combines each element of array
into one string with string
used to connect each element.
[105]:
fun_array = make_array("high", "great", "best")
"est ".join(fun_array)
[105]:
'highest greatest best'
[106]:
# you can join elements on the empty string to just merge the elements
some_strings = make_array("some", "list", "of", "strings")
"".join(some_strings)
[106]:
'somelistofstrings'
str.replace()
¶
string.replace(old_string, new_string)
Replaces each occurrence of old_string
in string
with new_string
.
[107]:
berkeley_string = "I saw 5 friends, 10 squirrels, and 20 people flyering on Sproul."
berkeley_string
[107]:
'I saw 5 friends, 10 squirrels, and 20 people flyering on Sproul.'
[108]:
berkeley_string.replace("friends", "frisbees")
[108]:
'I saw 5 frisbees, 10 squirrels, and 20 people flyering on Sproul.'
[110]:
# you can chain calls to .replace() since the return value is also a string
berkeley_string.replace("friends", "frisbees").replace("flyering on Sproul", "having a picnic on the Glade")
[110]:
'I saw 5 frisbees, 10 squirrels, and 20 people having a picnic on the Glade.'
Array Functions and Methods¶
[111]:
example_array = make_array(1, 3, 5, 7, 9)
example_array
[111]:
array([1, 3, 5, 7, 9])
max()
¶
max(array)
Returns the maximum value of an array.
[112]:
max(example_array)
[112]:
9
min()
¶
min(array)
Returns the minimum value of an array.
[113]:
min(example_array)
[113]:
1
sum()
¶
sum(array)
Returns the sum of values in an array.
[114]:
sum(example_array)
[114]:
25
[115]:
sum(make_array(1, 2, 0, -10))
[115]:
-7
abs()
¶
abs(num)
abs(array)
Take the absolute value of number or each number in an array.
[118]:
abs(-1)
[118]:
1
[119]:
new_arr = make_array(-3, -1, 5.2, 0.25, -4.9)
abs(new_arr)
[119]:
array([3. , 1. , 5.2 , 0.25, 4.9 ])
round(num)
¶
round(num)
round(num, d)
np.round(array)
np.round(array, d)
Round number or array of numbers to the nearest integer. If d
is specified, rounds to d
places after the decimal. Use np.round
to round arrays.
[124]:
round(3.14159)
[124]:
3
[125]:
round(3.14159, 3)
[125]:
3.142
[130]:
np.round(new_arr, 1)
[130]:
array([-3. , -1. , 5.2, 0.2, -4.9])
len()
¶
len(array)
Returns the length of an array.
[131]:
len(new_arr)
[131]:
5
make_array()
¶
make_array(val1, val2, ...)
Creates a new array with the values passed.
[132]:
new_array = make_array(25, 16, 9, 4, 1)
new_array
[132]:
array([25, 16, 9, 4, 1])
np.mean
¶
np.mean(array)
np.average(array)
Returns the mean of the values in an array.
[134]:
np.mean(new_array)
[134]:
11.0
[133]:
np.average(new_array)
[133]:
11.0
np.std()
¶
np.std(array)
Returns the standard deviation of the values in an array.
[150]:
np.std(new_array)
[150]:
8.648699324175862
np.diff()
¶
np.diff(array)
Returns an array with the pairwise differences between elements in the input array. The output will have length len(array) - 1
and will have elements \(x_1 - x_0\), \(x_2 - x_1\), \(x_3 - x_2\), etc.
[135]:
np.diff(new_array)
[135]:
array([-9, -7, -5, -3])
[136]:
np.diff(make_array(1, 3, 5, 7))
[136]:
array([2, 2, 2])
np.sqrt()
¶
np.sqrt(num)
np.sqrt(array)
Returns the square root of a number or an array of the square roots of each element in the input array.
[137]:
np.sqrt(4)
[137]:
2.0
[138]:
np.sqrt(new_array)
[138]:
array([5., 4., 3., 2., 1.])
np.arange()
¶
np.arange(stop)
np.arange(start, stop)
np.arange(start, stop, step)
Returns an array of integers from start
to stop
incrementing by step
. If start
is unspecified, it is assumed be 0. If step
is unspecified, it is assumed to be 1. The upper bound is exclusive, meaning that max(np.arange(10))
is 9.
[139]:
np.arange(0, 11)
[139]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
[140]:
np.arange(5)
[140]:
array([0, 1, 2, 3, 4])
[142]:
np.arange(0, 102, 2.7)
[142]:
array([ 0. , 2.7, 5.4, 8.1, 10.8, 13.5, 16.2, 18.9, 21.6, 24.3, 27. ,
29.7, 32.4, 35.1, 37.8, 40.5, 43.2, 45.9, 48.6, 51.3, 54. , 56.7,
59.4, 62.1, 64.8, 67.5, 70.2, 72.9, 75.6, 78.3, 81. , 83.7, 86.4,
89.1, 91.8, 94.5, 97.2, 99.9])
array.item()
¶
array.item(num)
Returns the item at index num
in an array (remember Python indices start at 0!).
[143]:
np.arange(0, 102, 2).item(1)
[143]:
2
[146]:
new_array.item(2)
[146]:
9
[147]:
new_array.item(len(new_array) - 1)
[147]:
1
np.random.choice
¶
np.random.choice(array)
np.random.choice(array, n, replace=True)
Picks one or n
of items from an array at random. By default, with replacement (set replace=False
for without replacement).
[149]:
np.random.choice(new_array)
[149]:
25
[150]:
np.random.choice(new_array, 3)
[150]:
array([ 4, 4, 16])
[152]:
np.random.choice(np.arange(0, 102, 2), 10, replace=False)
[152]:
array([ 98, 22, 12, 56, 24, 54, 100, 52, 28, 88])
np.count_nonzero()
¶
Returns the number of nonzero elements in an array. Because False
values are considered zeros (as integers), this can also give you the number of True
s in an array of boolean valuea.
[153]:
another_array = make_array(0, 1, 2, 0, 4, 0, 1, 0, 0)
np.count_nonzero(another_array)
[153]:
4
[159]:
bools = make_array(True, False, True, False, False, True, False)
np.count_nonzero(bools)
[159]:
3
np.append()
¶
np.append(array, item)
Returns a copy of the input array with item
(must be the same type as the other entries in the array) appended to the end.
[160]:
new_array
[160]:
array([25, 16, 9, 4, 1])
[161]:
np.append(new_array, 1000)
[161]:
array([ 25, 16, 9, 4, 1, 1000])
percentile()
¶
percentile(percent, array)
Returns the value corresponding to the specified percentile of an array. percent
should be in percentage form (i.e. 50
not 0.5
).
[162]:
long_array = make_array(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
long_array
[162]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
[163]:
percentile(50, long_array)
[163]:
2
[164]:
percentile(90, long_array)
[164]:
3
Table.where
Predicates¶
All of the predicates described below can be negated by preceding the name with not_
. For example, we can find values not equal to a specific value using are.not_equal_to(value)
.
are.equal_to()
¶
tbl.where(column, are.equal_to(value))
Filter leaves rows only where the value in column
is equal to value
.
[166]:
trips.where("Duration", are.equal_to(519))
[166]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 |
824979 | 519 | 6/27/2015 15:02 | Japantown | 9 | 6/27/2015 15:11 | San Jose City Hall | 10 | 660 | Customer | nil |
439946 | 519 | 9/5/2014 12:38 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 9/5/2014 12:47 | Civic Center BART (7th at Market) | 72 | 452 | Subscriber | 94105 |
788261 | 519 | 6/1/2015 9:21 | Powell at Post (Union Square) | 71 | 6/1/2015 9:30 | Steuart at Market | 74 | 575 | Subscriber | 94108 |
560479 | 519 | 11/28/2014 14:20 | South Van Ness at Market | 66 | 11/28/2014 14:29 | Powell at Post (Union Square) | 71 | 609 | Subscriber | 94108 |
653797 | 519 | 2/23/2015 8:47 | Market at 10th | 67 | 2/23/2015 8:55 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 546 | Subscriber | 94102 |
887134 | 519 | 8/12/2015 17:29 | Civic Center BART (7th at Market) | 72 | 8/12/2015 17:38 | Mechanics Plaza (Market at Battery) | 75 | 313 | Subscriber | 94103 |
482225 | 519 | 10/3/2014 16:41 | Spear at Folsom | 49 | 10/3/2014 16:50 | Broadway St at Battery St | 82 | 209 | Subscriber | 94133 |
681697 | 519 | 3/14/2015 10:19 | Embarcadero at Sansome | 60 | 3/14/2015 10:28 | Harry Bridges Plaza (Ferry Building) | 50 | 368 | Customer | 95120 |
912821 | 519 | 8/31/2015 17:00 | Embarcadero at Folsom | 51 | 8/31/2015 17:09 | San Francisco Caltrain (Townsend at 4th) | 70 | 354 | Subscriber | 94085 |
... (115 rows omitted)
are.above()
¶
tbl.where(column, are.above(value))
Filter leaves rows only where the value in column
is strictly greater than value
.
[167]:
trips.where("Duration", are.above(1000))
[167]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 |
589964 | 15097 | 1/3/2015 15:22 | Embarcadero at Sansome | 60 | 1/3/2015 19:33 | Golden Gate at Polk | 59 | 599 | Customer | 29225 |
831509 | 1057 | 7/2/2015 10:14 | 2nd at Folsom | 62 | 7/2/2015 10:31 | South Van Ness at Market | 66 | 631 | Subscriber | 94114 |
442750 | 6084 | 9/8/2014 13:32 | Embarcadero at Sansome | 60 | 9/8/2014 15:14 | Embarcadero at Sansome | 60 | 368 | Customer | 474454 |
608714 | 19799 | 1/18/2015 10:07 | University and Emerson | 35 | 1/18/2015 15:37 | San Francisco Caltrain (Townsend at 4th) | 70 | 686 | Customer | nil |
711961 | 1026 | 4/4/2015 7:07 | Davis at Jackson | 42 | 4/4/2015 7:24 | Harry Bridges Plaza (Ferry Building) | 50 | 189 | Subscriber | 94111 |
833071 | 2314 | 7/4/2015 11:13 | Market at 4th | 76 | 7/4/2015 11:52 | Washington at Kearny | 46 | 602 | Customer | 94806 |
570731 | 1218 | 12/8/2014 23:51 | MLK Library | 11 | 12/9/2014 0:12 | St James Park | 13 | 299 | Customer | 95033 |
853698 | 1048 | 7/20/2015 10:53 | Broadway St at Battery St | 82 | 7/20/2015 11:10 | Embarcadero at Sansome | 60 | 636 | Customer | 91436 |
787510 | 3670 | 5/31/2015 10:47 | Mountain View City Hall | 27 | 5/31/2015 11:48 | Castro Street and El Camino Real | 32 | 713 | Customer | 94041 |
... (11576 rows omitted)
are.above_or_equal_to()
¶
tbl.where(column, are.above_or_equal_to(value))
Filter leaves rows only where the value in column
is greater than or equal to value
.
[168]:
trips.where("Duration", are.above_or_equal_to(1000))
[168]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
710070 | 11460 | 4/2/2015 18:13 | Powell Street BART | 39 | 4/2/2015 21:24 | Powell Street BART | 39 | 375 | Subscriber | 94107 |
589964 | 15097 | 1/3/2015 15:22 | Embarcadero at Sansome | 60 | 1/3/2015 19:33 | Golden Gate at Polk | 59 | 599 | Customer | 29225 |
831509 | 1057 | 7/2/2015 10:14 | 2nd at Folsom | 62 | 7/2/2015 10:31 | South Van Ness at Market | 66 | 631 | Subscriber | 94114 |
442750 | 6084 | 9/8/2014 13:32 | Embarcadero at Sansome | 60 | 9/8/2014 15:14 | Embarcadero at Sansome | 60 | 368 | Customer | 474454 |
608714 | 19799 | 1/18/2015 10:07 | University and Emerson | 35 | 1/18/2015 15:37 | San Francisco Caltrain (Townsend at 4th) | 70 | 686 | Customer | nil |
711961 | 1026 | 4/4/2015 7:07 | Davis at Jackson | 42 | 4/4/2015 7:24 | Harry Bridges Plaza (Ferry Building) | 50 | 189 | Subscriber | 94111 |
833071 | 2314 | 7/4/2015 11:13 | Market at 4th | 76 | 7/4/2015 11:52 | Washington at Kearny | 46 | 602 | Customer | 94806 |
570731 | 1218 | 12/8/2014 23:51 | MLK Library | 11 | 12/9/2014 0:12 | St James Park | 13 | 299 | Customer | 95033 |
853698 | 1048 | 7/20/2015 10:53 | Broadway St at Battery St | 82 | 7/20/2015 11:10 | Embarcadero at Sansome | 60 | 636 | Customer | 91436 |
787510 | 3670 | 5/31/2015 10:47 | Mountain View City Hall | 27 | 5/31/2015 11:48 | Castro Street and El Camino Real | 32 | 713 | Customer | 94041 |
... (11597 rows omitted)
are.below()
¶
tbl.where(column, are.below(value))
Filter leaves rows only where the value in column
is strictly less than value
.
[170]:
trips.where("Duration", are.below(100))
[170]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
482797 | 65 | 10/4/2014 7:50 | San Francisco Caltrain (Townsend at 4th) | 70 | 10/4/2014 7:52 | San Francisco Caltrain (Townsend at 4th) | 70 | 430 | Subscriber | 95112 |
483052 | 81 | 10/4/2014 13:52 | Harry Bridges Plaza (Ferry Building) | 50 | 10/4/2014 13:53 | Harry Bridges Plaza (Ferry Building) | 50 | 306 | Customer | nan |
569620 | 84 | 12/8/2014 10:09 | Civic Center BART (7th at Market) | 72 | 12/8/2014 10:10 | Civic Center BART (7th at Market) | 72 | 326 | Subscriber | 94111 |
502332 | 79 | 10/16/2014 17:26 | Beale at Market | 56 | 10/16/2014 17:27 | Temporary Transbay Terminal (Howard at Beale) | 55 | 613 | Subscriber | 94602 |
604012 | 76 | 1/14/2015 15:18 | Davis at Jackson | 42 | 1/14/2015 15:19 | Broadway St at Battery St | 82 | 601 | Subscriber | 94107 |
704918 | 70 | 3/30/2015 22:51 | Broadway St at Battery St | 82 | 3/30/2015 22:52 | Broadway St at Battery St | 82 | 394 | Subscriber | 94107 |
513458 | 83 | 10/24/2014 8:50 | 2nd at Folsom | 62 | 10/24/2014 8:51 | Howard at 2nd | 63 | 569 | Subscriber | 94107 |
696725 | 94 | 3/25/2015 8:47 | Post at Kearny | 47 | 3/25/2015 8:49 | Washington at Kearny | 46 | 516 | Subscriber | 94109 |
829817 | 86 | 7/1/2015 9:27 | Market at Sansome | 77 | 7/1/2015 9:28 | 2nd at South Park | 64 | 292 | Subscriber | 94538 |
745895 | 73 | 4/29/2015 13:05 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 4/29/2015 13:06 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 380 | Subscriber | 94947 |
... (403 rows omitted)
are.below_or_equal_to()
¶
tbl.where(column, are.below_or_equal_to(value))
Filter leaves rows only where the value in column
is less than or equal to value
.
[171]:
trips.where("Duration", are.below_or_equal_to(100))
[171]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
482797 | 65 | 10/4/2014 7:50 | San Francisco Caltrain (Townsend at 4th) | 70 | 10/4/2014 7:52 | San Francisco Caltrain (Townsend at 4th) | 70 | 430 | Subscriber | 95112 |
483052 | 81 | 10/4/2014 13:52 | Harry Bridges Plaza (Ferry Building) | 50 | 10/4/2014 13:53 | Harry Bridges Plaza (Ferry Building) | 50 | 306 | Customer | nan |
569620 | 84 | 12/8/2014 10:09 | Civic Center BART (7th at Market) | 72 | 12/8/2014 10:10 | Civic Center BART (7th at Market) | 72 | 326 | Subscriber | 94111 |
502332 | 79 | 10/16/2014 17:26 | Beale at Market | 56 | 10/16/2014 17:27 | Temporary Transbay Terminal (Howard at Beale) | 55 | 613 | Subscriber | 94602 |
604012 | 76 | 1/14/2015 15:18 | Davis at Jackson | 42 | 1/14/2015 15:19 | Broadway St at Battery St | 82 | 601 | Subscriber | 94107 |
704918 | 70 | 3/30/2015 22:51 | Broadway St at Battery St | 82 | 3/30/2015 22:52 | Broadway St at Battery St | 82 | 394 | Subscriber | 94107 |
513458 | 83 | 10/24/2014 8:50 | 2nd at Folsom | 62 | 10/24/2014 8:51 | Howard at 2nd | 63 | 569 | Subscriber | 94107 |
696725 | 94 | 3/25/2015 8:47 | Post at Kearny | 47 | 3/25/2015 8:49 | Washington at Kearny | 46 | 516 | Subscriber | 94109 |
808199 | 100 | 6/15/2015 20:57 | Post at Kearny | 47 | 6/15/2015 20:58 | 2nd at South Park | 64 | 537 | Subscriber | 94107 |
829817 | 86 | 7/1/2015 9:27 | Market at Sansome | 77 | 7/1/2015 9:28 | 2nd at South Park | 64 | 292 | Subscriber | 94538 |
... (430 rows omitted)
are.between()
¶
tbl.where(column, are.between(x, y))
Filter leaves rows only where the value in column
is greather than or equal to x
and less than y
(i.e. in the interval \([x, y)\)).
[172]:
trips.where("Duration", are.between(100, 200))
[172]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
437830 | 151 | 9/4/2014 9:13 | Grant Avenue at Columbus Avenue | 73 | 9/4/2014 9:15 | Commercial at Montgomery | 45 | 306 | Subscriber | 94104 |
436255 | 195 | 9/3/2014 11:53 | 2nd at Folsom | 62 | 9/3/2014 11:57 | Spear at Folsom | 49 | 403 | Subscriber | 94107 |
585884 | 151 | 12/26/2014 13:34 | Broadway St at Battery St | 82 | 12/26/2014 13:37 | Harry Bridges Plaza (Ferry Building) | 50 | 576 | Subscriber | 94107 |
548322 | 191 | 11/17/2014 20:10 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 11/17/2014 20:13 | Market at Sansome | 77 | 29 | Subscriber | 94705 |
594999 | 185 | 1/7/2015 17:53 | San Antonio Caltrain Station | 29 | 1/7/2015 17:56 | San Antonio Shopping Center | 31 | 176 | Subscriber | 94040 |
468534 | 194 | 9/24/2014 19:08 | Mechanics Plaza (Market at Battery) | 75 | 9/24/2014 19:11 | Harry Bridges Plaza (Ferry Building) | 50 | 443 | Subscriber | 94107 |
873710 | 169 | 8/3/2015 17:20 | Broadway St at Battery St | 82 | 8/3/2015 17:23 | Embarcadero at Sansome | 60 | 532 | Subscriber | 94114 |
853087 | 168 | 7/20/2015 7:27 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/20/2015 7:30 | 2nd at Folsom | 62 | 418 | Subscriber | 94602 |
863019 | 162 | 7/27/2015 8:31 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/27/2015 8:34 | Mechanics Plaza (Market at Battery) | 75 | 504 | Subscriber | 94111 |
883134 | 173 | 8/10/2015 15:11 | Embarcadero at Folsom | 51 | 8/10/2015 15:14 | Beale at Market | 56 | 363 | Subscriber | 94117 |
... (5083 rows omitted)
are.between_or_equal_to()
¶
tbl.where(column, are.between_or_equal_to(x, y))
Filter leaves rows only where the value in column
is between or equal to x
and y
(i.e. in the interval \([x, y]\)).
[173]:
trips.where("Duration", are.between_or_equal_to(100, 200))
[173]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
437830 | 151 | 9/4/2014 9:13 | Grant Avenue at Columbus Avenue | 73 | 9/4/2014 9:15 | Commercial at Montgomery | 45 | 306 | Subscriber | 94104 |
436255 | 195 | 9/3/2014 11:53 | 2nd at Folsom | 62 | 9/3/2014 11:57 | Spear at Folsom | 49 | 403 | Subscriber | 94107 |
585884 | 151 | 12/26/2014 13:34 | Broadway St at Battery St | 82 | 12/26/2014 13:37 | Harry Bridges Plaza (Ferry Building) | 50 | 576 | Subscriber | 94107 |
548322 | 191 | 11/17/2014 20:10 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 11/17/2014 20:13 | Market at Sansome | 77 | 29 | Subscriber | 94705 |
903735 | 200 | 8/25/2015 7:59 | Temporary Transbay Terminal (Howard at Beale) | 55 | 8/25/2015 8:02 | Steuart at Market | 74 | 453 | Subscriber | 94501 |
594999 | 185 | 1/7/2015 17:53 | San Antonio Caltrain Station | 29 | 1/7/2015 17:56 | San Antonio Shopping Center | 31 | 176 | Subscriber | 94040 |
468534 | 194 | 9/24/2014 19:08 | Mechanics Plaza (Market at Battery) | 75 | 9/24/2014 19:11 | Harry Bridges Plaza (Ferry Building) | 50 | 443 | Subscriber | 94107 |
873710 | 169 | 8/3/2015 17:20 | Broadway St at Battery St | 82 | 8/3/2015 17:23 | Embarcadero at Sansome | 60 | 532 | Subscriber | 94114 |
853087 | 168 | 7/20/2015 7:27 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/20/2015 7:30 | 2nd at Folsom | 62 | 418 | Subscriber | 94602 |
863019 | 162 | 7/27/2015 8:31 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/27/2015 8:34 | Mechanics Plaza (Market at Battery) | 75 | 504 | Subscriber | 94111 |
... (5180 rows omitted)
are.contained_in()
¶
tbl.where(column, are.contained_in(string_or_array))
Filter leaves rows only where the value in column
is a substring of string_or_array
if it is a string or an element of string_or_array
if it is an array
[176]:
trips.where("Start Station", are.contained_in("2nd at Folsom San Antonio Caltrain Station"))
[176]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
436255 | 195 | 9/3/2014 11:53 | 2nd at Folsom | 62 | 9/3/2014 11:57 | Spear at Folsom | 49 | 403 | Subscriber | 94107 |
831509 | 1057 | 7/2/2015 10:14 | 2nd at Folsom | 62 | 7/2/2015 10:31 | South Van Ness at Market | 66 | 631 | Subscriber | 94114 |
877160 | 306 | 8/5/2015 16:33 | 2nd at Folsom | 62 | 8/5/2015 16:39 | Beale at Market | 56 | 527 | Subscriber | 94602 |
768619 | 840 | 5/15/2015 11:35 | 2nd at Folsom | 62 | 5/15/2015 11:49 | Market at 10th | 67 | 604 | Subscriber | 94903 |
594999 | 185 | 1/7/2015 17:53 | San Antonio Caltrain Station | 29 | 1/7/2015 17:56 | San Antonio Shopping Center | 31 | 176 | Subscriber | 94040 |
701211 | 252 | 3/27/2015 16:26 | 2nd at Folsom | 62 | 3/27/2015 16:30 | Spear at Folsom | 49 | 405 | Subscriber | 94105 |
487432 | 561 | 10/7/2014 17:48 | 2nd at Folsom | 62 | 10/7/2014 17:58 | Commercial at Montgomery | 45 | 342 | Subscriber | 94107 |
610970 | 808 | 1/20/2015 13:28 | 2nd at Folsom | 62 | 1/20/2015 13:42 | Harry Bridges Plaza (Ferry Building) | 50 | 310 | Subscriber | 94025 |
753668 | 196 | 5/5/2015 11:48 | 2nd at Folsom | 62 | 5/5/2015 11:52 | Temporary Transbay Terminal (Howard at Beale) | 55 | 533 | Subscriber | 94973 |
466551 | 222 | 9/23/2014 18:12 | 2nd at Folsom | 62 | 9/23/2014 18:16 | 2nd at Townsend | 61 | 620 | Subscriber | 94107 |
... (2578 rows omitted)
[178]:
trips.where("Start Terminal", are.contained_in(make_array(62, 29)))
[178]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
436255 | 195 | 9/3/2014 11:53 | 2nd at Folsom | 62 | 9/3/2014 11:57 | Spear at Folsom | 49 | 403 | Subscriber | 94107 |
831509 | 1057 | 7/2/2015 10:14 | 2nd at Folsom | 62 | 7/2/2015 10:31 | South Van Ness at Market | 66 | 631 | Subscriber | 94114 |
877160 | 306 | 8/5/2015 16:33 | 2nd at Folsom | 62 | 8/5/2015 16:39 | Beale at Market | 56 | 527 | Subscriber | 94602 |
768619 | 840 | 5/15/2015 11:35 | 2nd at Folsom | 62 | 5/15/2015 11:49 | Market at 10th | 67 | 604 | Subscriber | 94903 |
594999 | 185 | 1/7/2015 17:53 | San Antonio Caltrain Station | 29 | 1/7/2015 17:56 | San Antonio Shopping Center | 31 | 176 | Subscriber | 94040 |
701211 | 252 | 3/27/2015 16:26 | 2nd at Folsom | 62 | 3/27/2015 16:30 | Spear at Folsom | 49 | 405 | Subscriber | 94105 |
487432 | 561 | 10/7/2014 17:48 | 2nd at Folsom | 62 | 10/7/2014 17:58 | Commercial at Montgomery | 45 | 342 | Subscriber | 94107 |
610970 | 808 | 1/20/2015 13:28 | 2nd at Folsom | 62 | 1/20/2015 13:42 | Harry Bridges Plaza (Ferry Building) | 50 | 310 | Subscriber | 94025 |
753668 | 196 | 5/5/2015 11:48 | 2nd at Folsom | 62 | 5/5/2015 11:52 | Temporary Transbay Terminal (Howard at Beale) | 55 | 533 | Subscriber | 94973 |
466551 | 222 | 9/23/2014 18:12 | 2nd at Folsom | 62 | 9/23/2014 18:16 | 2nd at Townsend | 61 | 620 | Subscriber | 94107 |
... (2578 rows omitted)
are.containing()
¶
tbl.where(column, are.containing(value))
Filter leaves rows only where the value in column
contains the substring value
.
[180]:
trips.where("End Station", are.containing("at"))
[180]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
876419 | 413 | 8/5/2015 8:29 | Civic Center BART (7th at Market) | 72 | 8/5/2015 8:36 | Townsend at 7th | 65 | 269 | Subscriber | 94518 |
459672 | 408 | 9/18/2014 17:11 | Harry Bridges Plaza (Ferry Building) | 50 | 9/18/2014 17:17 | Embarcadero at Sansome | 60 | 429 | Subscriber | 94111 |
903647 | 723 | 8/25/2015 7:26 | San Francisco Caltrain 2 (330 Townsend) | 69 | 8/25/2015 7:38 | Market at 10th | 67 | 631 | Subscriber | 94025 |
452829 | 409 | 9/15/2014 8:29 | Steuart at Market | 74 | 9/15/2014 8:36 | Market at 4th | 76 | 428 | Subscriber | 94925 |
491023 | 224 | 10/9/2014 16:13 | Santa Clara at Almaden | 4 | 10/9/2014 16:17 | San Jose Diridon Caltrain Station | 2 | 144 | Subscriber | 94117 |
723352 | 519 | 4/13/2015 17:04 | Howard at 2nd | 63 | 4/13/2015 17:12 | San Francisco Caltrain (Townsend at 4th) | 70 | 629 | Subscriber | 94061 |
524499 | 431 | 10/31/2014 16:36 | Townsend at 7th | 65 | 10/31/2014 16:43 | Civic Center BART (7th at Market) | 72 | 630 | Subscriber | 94706 |
518524 | 389 | 10/28/2014 8:48 | Market at Sansome | 77 | 10/28/2014 8:54 | 2nd at South Park | 64 | 458 | Subscriber | 94610 |
793149 | 616 | 6/4/2015 5:26 | Embarcadero at Bryant | 54 | 6/4/2015 5:36 | Embarcadero at Sansome | 60 | 289 | Subscriber | 94105 |
681771 | 895 | 3/14/2015 11:46 | Market at 10th | 67 | 3/14/2015 12:01 | Market at 4th | 76 | 416 | Subscriber | 94107 |
... (78805 rows omitted)
are.strictly_between()
¶
tbl.where(column, are.strictly_between(x, y))
Filter leaves rows only where the value in column
is strictly greater than x
and less than y
(i.e. in the interval \((x, y)\)).
[181]:
trips.where("Duration", are.strictly_between(100, 200))
[181]:
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code |
---|---|---|---|---|---|---|---|---|---|---|
437830 | 151 | 9/4/2014 9:13 | Grant Avenue at Columbus Avenue | 73 | 9/4/2014 9:15 | Commercial at Montgomery | 45 | 306 | Subscriber | 94104 |
436255 | 195 | 9/3/2014 11:53 | 2nd at Folsom | 62 | 9/3/2014 11:57 | Spear at Folsom | 49 | 403 | Subscriber | 94107 |
585884 | 151 | 12/26/2014 13:34 | Broadway St at Battery St | 82 | 12/26/2014 13:37 | Harry Bridges Plaza (Ferry Building) | 50 | 576 | Subscriber | 94107 |
548322 | 191 | 11/17/2014 20:10 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 11/17/2014 20:13 | Market at Sansome | 77 | 29 | Subscriber | 94705 |
594999 | 185 | 1/7/2015 17:53 | San Antonio Caltrain Station | 29 | 1/7/2015 17:56 | San Antonio Shopping Center | 31 | 176 | Subscriber | 94040 |
468534 | 194 | 9/24/2014 19:08 | Mechanics Plaza (Market at Battery) | 75 | 9/24/2014 19:11 | Harry Bridges Plaza (Ferry Building) | 50 | 443 | Subscriber | 94107 |
873710 | 169 | 8/3/2015 17:20 | Broadway St at Battery St | 82 | 8/3/2015 17:23 | Embarcadero at Sansome | 60 | 532 | Subscriber | 94114 |
853087 | 168 | 7/20/2015 7:27 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/20/2015 7:30 | 2nd at Folsom | 62 | 418 | Subscriber | 94602 |
863019 | 162 | 7/27/2015 8:31 | Temporary Transbay Terminal (Howard at Beale) | 55 | 7/27/2015 8:34 | Mechanics Plaza (Market at Battery) | 75 | 504 | Subscriber | 94111 |
883134 | 173 | 8/10/2015 15:11 | Embarcadero at Folsom | 51 | 8/10/2015 15:14 | Beale at Market | 56 | 363 | Subscriber | 94117 |
... (5056 rows omitted)
Miscellaneous Functions¶
sample_proportions()
¶
sample_proportions(sample_size, model_proportions)
Samples sample_size
objects from the distribution specified by model_proportions
. sample_size
should be an integer, model_proportions an array of probabilities that sum up to 1. It returns an array with the same size as model_proportions
. Each item in the array corresponds to the proportion of times it was sampled out of the sample_size
times.
[182]:
sample_proportions(100, [.5, .3, .2])
[182]:
array([0.32, 0.32, 0.36])
minimize()
¶
minimize(function)
This function returns an array of values that minimize function
. function
should be a function that takes in a certain number of arguments and returns a number. The array returned by minimize
is structured such that if each value in the array was passed into function as arguments, it would minimize the output value of function.
[190]:
def f(x, y):
return 0.47 * x**2 + 1.23 * np.log(y)
minimize(f)
[190]:
array([ 5.17585792, -0.58835469])
Reference¶
Tables (datascience.tables
)¶
Summary of methods for Table. Click a method to see its documentation.
One note about reading the method signatures for this page: each method is listed with its arguments. However, optional arguments are specified in brackets. That is, a method that’s documented like
Table.foo
(first_arg, second_arg[, some_other_arg, fourth_arg])
means that the Table.foo
method must be called with first_arg and second_arg
and optionally some_other_arg and fourth_arg. That means the following are valid
ways to call Table.foo
:
some_table.foo(1, 2)
some_table.foo(1, 2, 'hello')
some_table.foo(1, 2, 'hello', 'world')
some_table.foo(1, 2, some_other_arg='hello')
But these are not valid:
some_table.foo(1) # Missing arg
some_table.foo(1, 2[, 'hi']) # SyntaxError
some_table.foo(1, 2[, 'hello', 'world']) # SyntaxError
If that syntax is confusing, you can click the method name itself to get to the details page for that method. That page will have a more straightforward syntax.
Creation
|
Create an empty table with column labels. |
|
Create a table from a sequence of records (dicts with fixed keys). |
|
Create a table from a mapping of column labels to column values. |
|
Read a table from a file or web address. |
|
Convert a Pandas DataFrame into a Table. |
|
Convert a structured NumPy array into a Table. |
Extension (does not modify original table)
|
Return a new table with an additional or replaced column. |
|
Return a table with additional or replaced columns. |
|
Return a table with an additional row. |
|
Return a table with additional rows. |
|
Return a new table with |
Accessing values
Number of columns. |
|
Return a tuple of columns, each with the values in that column. |
|
|
Return the values of a column as an array. |
Computes the number of rows in a table |
|
Return a view of all rows. |
|
|
Return a row. |
Return a tuple of column labels. |
|
|
Return the zeroth item in a column. |
|
Return the last item in a column. |
Return data in self as a numpy array. |
|
|
Return the index of a column by looking up its label. |
|
Apply |
Mutation (modifies table in place)
|
Set the pretty print format of a column(s) and/or convert its values. |
|
Move a column to be the first column. |
|
Move a column to be the last column. |
|
Append a row or all rows of a table in place. |
|
Appends a column to the table or replaces a column. |
|
Changes the label(s) of column(s) specified by |
|
Removes a row or multiple rows of a table in place (row number is 0 indexed). |
Transformation (creates a new table)
|
Return a copy of a table. |
|
Return a table with only the columns in |
|
Return a Table with only columns other than selected label or labels. |
Return a new Table with selected rows taken by index. |
|
Return a new Table without a sequence of rows excluded by number. |
|
|
Returns a new table with specified column moved to the specified column index. |
|
Return a new |
|
Return a Table of rows sorted according to the values in a column. |
|
Group rows by unique values in a column; count or aggregate others. |
|
Group rows by multiple columns, count or aggregate others. |
|
Generate a table with a column for each unique value in |
|
Takes k original columns and returns two columns, with col. |
|
Creates a new table with the columns of self and other, containing rows for all values of a column that appear in both tables. |
|
Compute statistics for each column and place them in a table. |
Return a new table with one row containing the pth percentile for each column. |
|
|
Return a new table where k rows are randomly sampled from the original table. |
Return a new table where all the rows are randomly shuffled from the original table. |
|
|
Return a new table with the same number of rows and a new column. |
|
Return a tuple of two tables where the first table contains |
|
Group values by bin and compute counts per bin by column. |
|
Form a table with columns formed by the unique tuples in pivot_columns containing counts per bin of the values associated with each tuple in the value_column. |
|
Return a new table with |
|
Return a table with an additional row. |
|
Return a table with additional rows. |
|
Return a new table with an additional or replaced column. |
|
Return a table with additional or replaced columns. |
Exporting / Displaying
|
Display the table. |
|
Format table as text |
|
Format table as HTML |
|
Return a dict keyed by values in a column that contains lists of |
Convert the table to a structured NumPy array. |
|
Convert the table to a Pandas DataFrame. |
|
|
Creates a CSV file with the provided filename. |
Visualizations
|
Plot line charts for the table. |
|
Plot bar charts for the table. |
|
Plot a bar chart for the table. |
|
Plot horizontal bar charts for the table. |
|
Plot a horizontal bar chart for the table. |
|
Draw histograms of each category in a column. |
|
Plots one histogram for each column in columns. |
|
Plots one count-based histogram for each column in columns. |
|
Creates scatterplots, optionally adding a line of best fit. |
|
Convenience wrapper for |
|
Plots a boxplot for the table. |
Redirects |
|
Turns off redirection of |
Maps (datascience.maps
)¶
Draw maps using folium.
- class datascience.maps.Circle(lat, lon, popup='', color='blue', area=314.1592653589793, **kwargs)[source]¶
A marker displayed with either Folium’s circle_marker or circle methods.
The
circle_marker
method draws circles that stay the same size regardless of map zoom, whereas the circle method draws circles that have a fixed radius in meters. To toggle between them, use theradius_in_meters
flag in the draw_on function.popup – text that pops up when marker is clicked color – fill color area – pixel-squared area of the circle
Defaults from Folium:
- fill_opacity: float, default 0.6
Circle fill opacity
More options can be passed into kwargs by following the attributes listed in https://leafletjs.com/reference-1.4.0.html#circlemarker or https://leafletjs.com/reference-1.4.0.html#circle.
For example, to draw three circles with circle_marker:
t = Table().with_columns([ 'lat', [37.8, 38, 37.9], 'lon', [-122, -122.1, -121.9], 'label', ['one', 'two', 'three'], 'color', ['red', 'green', 'blue'], 'area', [3000, 4000, 5000], ]) Circle.map_table(t)
To draw three circles with the circle methods, replace the last line with:
Circle.map_table(t, radius_in_meters=True)
- class datascience.maps.Map(features=(), ids=(), width=960, height=500, **kwargs)[source]¶
A map from IDs to features. Keyword args are forwarded to folium.
- color(values, ids=(), key_on='feature.id', palette='YlOrBr', **kwargs)[source]¶
Color map features by binning values.
values – a sequence of values or a table of keys and values ids – an ID for each value; if none are provided, indices are used key_on – attribute of each feature to match to ids palette – one of the following color brewer palettes:
‘BuGn’, ‘BuPu’, ‘GnBu’, ‘OrRd’, ‘PuBu’, ‘PuBuGn’, ‘PuRd’, ‘RdPu’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, and ‘YlOrRd’.
Defaults from Folium:
- threshold_scale: list, default None
Data range for D3 threshold scale. Defaults to the following range of quantiles: [0, 0.5, 0.75, 0.85, 0.9], rounded to the nearest order-of-magnitude integer. Ex: 270 rounds to 200, 5600 to 6000.
- fill_opacity: float, default 0.6
Area fill opacity, range 0-1.
- line_color: string, default ‘black’
GeoJSON geopath line color.
- line_weight: int, default 1
GeoJSON geopath line weight.
- line_opacity: float, default 1
GeoJSON geopath line opacity, range 0-1.
- legend_name: string, default None
Title for data legend. If not passed, defaults to columns[1].
- copy()[source]¶
Copies the current Map into a new one and returns it. Note: This only copies rendering attributes. The underlying map is NOT deep-copied. This is as a result of no functionality in Folium. Ref: https://github.com/python-visualization/folium/issues/1207
- property features¶
- overlay(feature, color='Blue', opacity=0.6)[source]¶
Overlays
feature
on the map. Returns a new Map.- Args:
feature
: aTable
of map features, a list of map features,a Map, a Region, or a circle marker map table. The features will be overlayed on the Map with specified
color
.
color
(str
): Color of feature. Defaults to ‘Blue’opacity
(float
): Opacity of overlain feature. Defaults to0.6.
- Returns:
A new
Map
with the overlainfeature
.
- class datascience.maps.Marker(lat, lon, popup='', color='blue', **kwargs)[source]¶
A marker displayed with Folium’s simple_marker method.
popup – text that pops up when marker is clicked color – The color of the marker. You can use: [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘darkred’, ’lightred’, ‘beige’, ‘darkblue’, ‘darkgreen’, ‘cadetblue’, ‘darkpurple’, ‘white’, ‘pink’, ‘lightblue’, ‘lightgreen’, ‘gray’, ‘black’, ‘lightgray’] to use standard folium icons. If a hex color code is provided, (color must start with ‘#’), a folium.plugin.BeautifyIcon will be used instead.
Defaults from Folium:
- marker_icon: string, default ‘info-sign’
icon from (http://getbootstrap.com/components/) you want on the marker
- clustered_marker: boolean, default False
boolean of whether or not you want the marker clustered with other markers
- icon_angle: int, default 0
angle of icon
- popup_width: int, default 300
width of popup
The icon can be further customized by by passing in attributes into kwargs by using the attributes listed in https://python-visualization.github.io/folium/modules.html#folium.map.Icon.
- property lat_lons¶
Sequence of lat_lons that describe a map feature (for zooming).
- classmethod map(latitudes, longitudes, labels=None, colors=None, areas=None, other_attrs=None, clustered_marker=False, **kwargs)[source]¶
Return markers from columns of coordinates, labels, & colors.
The areas column is not applicable to markers, but sets circle areas.
Arguments: (TODO) document all options
- index_map: list of integers, default None (when not applicable)
list of indices that maps each marker to a corresponding label at the index in cluster_labels (only applicable when multiple marker clusters are being used)
- cluster_labels: list of strings, default None (when not applicable)
list of labels used for each cluster of markers (only applicable when multiple marker clusters are being used)
- colorbar_scale: list of floats, default None (when not applicable)
list of cutoffs used to indicate where the bins are for each color (only applicable when colorscale gradient is being used)
- include_color_scale_outliers: boolean, default None (when not applicable)
boolean of whether or not outliers are included in the colorscale gradient for markers (only applicable when colorscale gradient is being used)
- radius_in_meters: boolean, default False
boolean of whether or not Circles should have their radii specified in meters, scales with map zoom
- clustered_marker: boolean, default False
boolean of whether or not you want the marker clustered with other markers
- other_attrs: dictionary of (key) property names to (value) property values, default None
A dictionary that list any other attributes that the class Marker/Circle should have
- classmethod map_table(table, clustered_marker=False, include_color_scale_outliers=True, radius_in_meters=False, **kwargs)[source]¶
Return markers from the columns of a table.
The first two columns of the table must be the latitudes and longitudes (in that order), followed by ‘labels’, ‘colors’, ‘color_scale’, ‘radius_scale’, ‘cluster_by’, ‘area_scale’, and/or ‘areas’ (if applicable) in any order with columns explicitly stating what property they are representing.
- Args:
cls
: Type of marker being drawn on the map {Marker, Circle}.table
: Table of data to be made into markers. The first two columns of the table must be the latitudes and longitudes (in that order), followed by ‘labels’, ‘colors’, ‘cluster_by’, ‘color_scale’, ‘radius_scale’, ‘area_scale’, and/or ‘areas’ (if applicable) in any order with columns explicitly stating what property they are representing. Additional columns for marker-specific attributes such as ‘marker_icon’ for the Marker class can be included as well.clustered_marker
: Boolean indicating if markers should be clustered with folium.plugins.MarkerCluster.include_color_scale_outliers
: Boolean indicating if outliers should be included in the color scale gradient or not.radius_in_meters
: Boolean indicating if circle markers should be drawn to map scale or zoom scale.
- class datascience.maps.Region(geojson, **kwargs)[source]¶
A GeoJSON feature displayed with Folium’s geo_json method.
- property lat_lons¶
A flat list of (lat, lon) pairs.
- property polygons¶
Return a list of polygons describing the region.
Each polygon is a list of linear rings, where the first describes the exterior and the rest describe interior holes.
Each linear ring is a list of positions where the last is a repeat of the first.
Each position is a (lat, lon) pair.
- property properties¶
- property type¶
The GEOJSON type of the regions: Polygon or MultiPolygon.
- datascience.maps.get_coordinates(table, replace_columns=False, remove_nans=False)[source]¶
Adds latitude and longitude coordinates to table based on other location identifiers. Must be in the United States.
Takes table with columns “zip code” or “city” and/or “county” and “state” in column names and adds the columns “lat” and “lon”. If a county is not found inside the dataset, that row’s latitude and longitude coordinates are replaced with np.nans. The ‘replace_columns’ flag indicates if the “city”, “county”, “state”, and “zip code” columns should be removed afterwards. The ‘remove_nans’ flag indicates if rows with nan latitudes and longitudes should be removed. Robust to capitalization in city and county names. If a row’s location with multiple zip codes is specified, the latitude and longitude pair assigned to the row will correspond to the smallest zip code.
Dataset was acquired on July 2, 2020 from https://docs.gaslamp.media/download-zip-code-latitude-longitude-city-state-county-csv. Found in geocode_datasets/geocode_states.csv. Modified column names and made city/county columns all in lowercase.
- Args:
table: A table with counties that need to mapped to coordinates replace_columns: A boolean that indicates if “county”, “city”, “state”, and “zip code” columns should be removed remove_nans: A boolean that indicates if columns with invalid longitudes and latitudes should be removed
- Returns:
Table with latitude and longitude coordinates
Predicates (datascience.predicates
)¶
Predicate functions.
- class datascience.predicates.are[source]¶
Predicate functions. The class is named “are” for calls to where.
For example, given a table, predicates can be used to pick rows as follows.
>>> from datascience import Table >>> t = Table().with_columns([ ... 'Sizes', ['S', 'M', 'L', 'XL'], ... 'Waists', [30, 34, 38, 42], ... ]) >>> t.where('Sizes', are.equal_to('L')) Sizes | Waists L | 38 >>> t.where('Waists', are.above(38)) Sizes | Waists XL | 42 >>> t.where('Waists', are.above_or_equal_to(38)) Sizes | Waists L | 38 XL | 42 >>> t.where('Waists', are.below(38)) Sizes | Waists S | 30 M | 34 >>> t.where('Waists', are.below_or_equal_to(38)) Sizes | Waists S | 30 M | 34 L | 38 >>> t.where('Waists', are.strictly_between(30, 38)) Sizes | Waists M | 34 >>> t.where('Waists', are.between(30, 38)) Sizes | Waists S | 30 M | 34 >>> t.where('Waists', are.between_or_equal_to(30, 38)) Sizes | Waists S | 30 M | 34 L | 38 >>> t.where('Sizes', are.equal_to('L')) Sizes | Waists L | 38 >>> t.where('Waists', are.not_above(38)) Sizes | Waists S | 30 M | 34 L | 38 >>> t.where('Waists', are.not_above_or_equal_to(38)) Sizes | Waists S | 30 M | 34 >>> t.where('Waists', are.not_below(38)) Sizes | Waists L | 38 XL | 42 >>> t.where('Waists', are.not_below_or_equal_to(38)) Sizes | Waists XL | 42 >>> t.where('Waists', are.not_strictly_between(30, 38)) Sizes | Waists S | 30 L | 38 XL | 42 >>> t.where('Waists', are.not_between(30, 38)) Sizes | Waists L | 38 XL | 42 >>> t.where('Waists', are.not_between_or_equal_to(30, 38)) Sizes | Waists XL | 42 >>> t.where('Sizes', are.containing('L')) Sizes | Waists L | 38 XL | 42 >>> t.where('Sizes', are.not_containing('L')) Sizes | Waists S | 30 M | 34 >>> t.where('Sizes', are.contained_in('MXL')) Sizes | Waists M | 34 L | 38 XL | 42 >>> t.where('Sizes', are.contained_in('L')) Sizes | Waists L | 38 >>> t.where('Sizes', are.not_contained_in('MXL')) Sizes | Waists S | 30
Formats (datascience.formats
)¶
String formatting for table entries.
- class datascience.formats.CurrencyFormatter(symbol='$', *args, **vargs)[source]¶
Format currency and convert to float.
- class datascience.formats.DateFormatter(format='%Y-%m-%d %H:%M:%S.%f', *args, **vargs)[source]¶
Format date & time and convert to UNIX timestamp.
- class datascience.formats.DistributionFormatter(decimals=2, *args, **vargs)[source]¶
Normalize a column and format as percentages.
- class datascience.formats.Formatter(min_width=None, max_width=None, etc=None)[source]¶
String formatter that truncates long values.
- property converts_values¶
Whether this Formatter also converts values.
- etc = ' ...'¶
- max_width = 60¶
- min_width = 4¶
Utility Functions (datascience.util
)¶
Utility functions
- datascience.util.is_non_string_iterable(value)[source]¶
Returns a boolean value representing whether a value is iterable.
- datascience.util.make_array(*elements)[source]¶
Returns an array containing all the arguments passed to this function. A simple way to make an array with a few elements.
As with any array, all arguments should have the same type.
- Args:
elements
(variadic): elements- Returns:
A NumPy array of same length as the provided varadic argument
elements
>>> make_array(0) array([0]) >>> make_array(2, 3, 4) array([2, 3, 4]) >>> make_array("foo", "bar") array(['foo', 'bar'], dtype='<U3') >>> make_array() array([], dtype=float64)
- datascience.util.minimize(f, start=None, smooth=False, log=None, array=False, **vargs)[source]¶
Minimize a function f of one or more arguments.
- Args:
f: A function that takes numbers and returns a number
start: A starting value or list of starting values
smooth: Whether to assume that f is smooth and use first-order info
log: Logging function called on the result of optimization (e.g. print)
vargs: Other named arguments passed to scipy.optimize.minimize
- Returns either:
the minimizing argument of a one-argument function
an array of minimizing arguments of a multi-argument function
- datascience.util.percentile(p, arr=None)[source]¶
Returns the pth percentile of the input array (the value that is at least as great as p% of the values in the array).
If arr is not provided, percentile returns itself curried with p
>>> percentile(74.9, [1, 3, 5, 9]) 5 >>> percentile(75, [1, 3, 5, 9]) 5 >>> percentile(75.1, [1, 3, 5, 9]) 9 >>> f = percentile(75) >>> f([1, 3, 5, 9]) 5
- datascience.util.plot_cdf_area(rbound=None, lbound=None, mean=0, sd=1)¶
Plots a normal curve with specified parameters and area below curve shaded between
lbound
andrbound
.- Args:
rbound
(numeric): right boundary of shaded regionlbound
(numeric): left boundary of shaded region; by default is negative infinitymean
(numeric): mean/expectation of normal distributionsd
(numeric): standard deviation of normal distribution
- datascience.util.plot_normal_cdf(rbound=None, lbound=None, mean=0, sd=1)[source]¶
Plots a normal curve with specified parameters and area below curve shaded between
lbound
andrbound
.- Args:
rbound
(numeric): right boundary of shaded regionlbound
(numeric): left boundary of shaded region; by default is negative infinitymean
(numeric): mean/expectation of normal distributionsd
(numeric): standard deviation of normal distribution
- datascience.util.proportions_from_distribution(table, label, sample_size, column_name='Random Sample')[source]¶
Adds a column named
column_name
containing the proportions of a random draw using the distribution inlabel
.This method uses
np.random.Generator.multinomial
to drawsample_size
samples from the distribution intable.column(label)
, then divides bysample_size
to create the resulting column of proportions.- Args:
table
: An instance ofTable
.label
: Label of column intable
. This column must contain adistribution (the values must sum to 1).
sample_size
: The size of the sample to draw from the distribution.column_name
: The name of the new column that contains the sampledproportions. Defaults to
'Random Sample'
.
- Returns:
A copy of
table
with a columncolumn_name
containing the sampled proportions. The proportions will sum to 1.- Throws:
ValueError
: If thelabel
is not in the table, or iftable.column(label)
does not sum to 1.
- datascience.util.sample_proportions(sample_size: int, probabilities)[source]¶
Return the proportion of random draws for each outcome in a distribution.
This function is similar to np.random.Generator.multinomial, but returns proportions instead of counts.
- Args:
sample_size
: The size of the sample to draw from the distribution.probabilities
: An array of probabilities that forms a distribution.- Returns:
An array with the same length as
probability
that sums to 1.
- datascience.util.table_apply(table, func, subset=None)[source]¶
Applies a function to each column and returns a Table.
- Args:
table
: The table to apply your function to.func
: The function to apply to each column.subset
: A list of columns to apply the function to; if None,the function will be applied to all columns in table.
- Returns:
A table with the given function applied. It will either be the shape == shape(table), or shape (1, table.shape[1])