link experiment

This commit is contained in:
David Beazley
2020-05-26 09:21:19 -05:00
parent 9aec32e1a9
commit b5244b0e61
24 changed files with 4265 additions and 2 deletions

View File

@@ -0,0 +1,14 @@
# Overview
A simple definition of *Iteration*: Looping over items.
```python
a = [2,4,10,37,62]
# Iterate over a
for x in a:
...
```
This is a very common pattern. Loops, list comprehensions, etc.
Most programs do a huge amount of iteration.

View File

@@ -0,0 +1,313 @@
# 6.1 Iteration Protocol
This section looks at the process of iteration.
### Iteration Everywhere
Many different objects support iteration.
```python
a = 'hello'
for c in a: # Loop over characters in a
...
b = { 'name': 'Dave', 'password':'foo'}
for k in b: # Loop over keys in dictionary
...
c = [1,2,3,4]
for i in c: # Loop over items in a list/tuple
...
f = open('foo.txt')
for x in f: # Loop over lines in a file
...
```
### Iteration: Protocol
Let's take an inside look at the `for` statement.
```python
for x in obj:
# statements
```
What happens under the hood?
```python
_iter = obj.__iter__() # Get iterator object
while True:
try:
x = _iter.__next__() # Get next item
except StopIteration: # No more items
break
# statements ...
```
All the objects that work with the `for-loop` implement this low-level iteration protocol.
Example: Manual iteration over a list.
```python
>>> x = [1,2,3]
>>> it = x.__iter__()
>>> it
<listiterator object at 0x590b0>
>>> it.__next__()
1
>>> it.__next__()
2
>>> it.__next__()
3
>>> it.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in ? StopIteration
>>>
```
### Supporting Iteration
Knowing about iteration is useful if you want to add it to your own objects.
For example, making a custom container.
```python
class Portfolio(object):
def __init__(self):
self.holdings = []
def __iter__(self):
return self.holdings.__iter__()
...
port = Portfolio()
for s in port:
...
```
## Exercises
### (a) Iteration Illustrated
Create the following list:
```python
a = [1,9,4,25,16]
```
Manually iterate over this list. Call `__iter__()` to get an iterator and
call the `__next__()` method to obtain successive elements.
```python
>>> i = a.__iter__()
>>> i
<listiterator object at 0x64c10>
>>> i.__next__()
1
>>> i.__next__()
9
>>> i.__next__()
4
>>> i.__next__()
25
>>> i.__next__()
16
>>> i.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>
```
The `next()` built-in function is a shortcut for calling
the `__next__()` method of an iterator. Try using it on a file:
```python
>>> f = open('Data/portfolio.csv')
>>> f.__iter__() # Note: This returns the file itself
<_io.TextIOWrapper name='Data/portfolio.csv' mode='r' encoding='UTF-8'>
>>> next(f)
'name,shares,price\n'
>>> next(f)
'"AA",100,32.20\n'
>>> next(f)
'"IBM",50,91.10\n'
>>>
```
Keep calling `next(f)` until you reach the end of the
file. Watch what happens.
### (b) Supporting Iteration
On occasion, you might want to make one of your own objects support
iteration--especially if your object wraps around an existing
list or other iterable. In a new file `portfolio.py`, define the
following class:
```python
# portfolio.py
class Portfolio(object):
def __init__(self, holdings):
self._holdings = holdings
@property
def total_cost(self):
return sum([s.cost for s in self._holdings])
def tabulate_shares(self):
from collections import Counter
total_shares = Counter()
for s in self._holdings:
total_shares[s.name] += s.shares
return total_shares
```
This class is meant to be a layer around a list, but with some
extra methods such as the `total_cost` property. Modify the `read_portfolio()`
function in `report.py` so that it creates a `Portfolio` instance like this:
```
# report.py
...
import fileparse
from stock import Stock
from portfolio import Portfolio
def read_portfolio(filename):
'''
Read a stock portfolio file into a list of dictionaries with keys
name, shares, and price.
'''
with open(filename) as file:
portdicts = fileparse.parse_csv(file,
select=['name','shares','price'],
types=[str,int,float])
portfolio = [ Stock(d['name'], d['shares'], d['price']) for d in portdicts ]
return Portfolio(portfolio)
...
```
Try running the `report.py` program. You will find that it fails spectacularly due to the fact
that `Portfolio` instances aren't iterable.
```python
>>> import report
>>> report.portfolio_report('Data/portfolio.csv', 'Data/prices.csv')
... crashes ...
```
Fix this by modifying the `Portfolio` class to support iteration:
```python
class Portfolio(object):
def __init__(self, holdings):
self._holdings = holdings
def __iter__(self):
return self._holdings.__iter__()
@property
def total_cost(self):
return sum([s.shares*s.price for s in self._holdings])
def tabulate_shares(self):
from collections import Counter
total_shares = Counter()
for s in self._holdings:
total_shares[s.name] += s.shares
return total_shares
```
After you've made this change, your `report.py` program should work again. While you're
at it, fix up your `pcost.py` program to use the new `Portfolio` object. Like this:
```python
# pcost.py
import report
def portfolio_cost(filename):
'''
Computes the total cost (shares*price) of a portfolio file
'''
portfolio = report.read_portfolio(filename)
return portfolio.total_cost
...
```
Test it to make sure it works:
```python
>>> import pcost
>>> pcost.portfolio_cost('Data/portfolio.csv')
44671.15
>>>
```
### (d) Making a more proper container
If making a container class, you often want to do more than just
iteration. Modify the `Portfolio` class so that it has some other
special methods like this:
```python
class Portfolio(object):
def __init__(self, holdings):
self._holdings = holdings
def __iter__(self):
return self._holdings.__iter__()
def __len__(self):
return len(self._holdings)
def __getitem__(self, index):
return self._holdings[index]
def __contains__(self, name):
return any([s.name == name for s in self._holdings])
@property
def total_cost(self):
return sum([s.shares*s.price for s in self._holdings])
def tabulate_shares(self):
from collections import Counter
total_shares = Counter()
for s in self._holdings:
total_shares[s.name] += s.shares
return total_shares
```
Now, try some experiments using this new class:
```
>>> import report
>>> portfolio = report.read_portfolio('Data/portfolio.csv')
>>> len(portfolio)
7
>>> portfolio[0]
Stock('AA', 100, 32.2)
>>> portfolio[1]
Stock('IBM', 50, 91.1)
>>> portfolio[0:3]
[Stock('AA', 100, 32.2), Stock('IBM', 50, 91.1), Stock('CAT', 150, 83.44)]
>>> 'IBM' in portfolio
True
>>> 'AAPL' in portfolio
False
>>>
```
One important observation about this--generally code is considered
"Pythonic" if it speaks the common vocabulary of how other parts of
Python normally work. For container objects, supporting iteration,
indexing, containment, and other kinds of operators is an important
part of this.
[Next](02_Customizing_iteration)

View File

@@ -0,0 +1,265 @@
# 6.2 Customizing Iteration
This section looks at how you can customize iteration using a generator.
### A problem
Suppose you wanted to create your own custom iteration pattern.
For example, a countdown.
```python
>>> for x in countdown(10):
... print(x, end=' ')
...
10 9 8 7 6 5 4 3 2 1
>>>
```
There is an easy way to do this.
### Generators
A generator is a function that defines iteration.
```python
def countdown(n):
while n > 0:
yield n
n -= 1
```
For example:
```python
>>> for x in countdown(10):
... print(x, end=' ')
...
10 9 8 7 6 5 4 3 2 1
>>>
```
A generator is any function that uses the `yield` statement.
The behavior of generators is different than a normal function.
Calling a generator function creates a generator object. It does not execute the function.
```python
def countdown(n):
# Added a print statement
print('Counting down from', n)
while n > 0:
yield n
n -= 1
```
```python
>>> x = countdown(10)
# There is NO PRINT STATEMENT
>>> x
# x is a generator object
<generator object at 0x58490>
>>>
```
The function only executes on `__next__()` call.
```python
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.__next__()
Counting down from 10
10
>>>
```
`yield` produces a value, but suspends the function execution.
The function resumes on next call to `__next__()`.
```python
>>> x.__next__()
9
>>> x.__next__()
8
```
When the generator returns, the iteration raises an error.
```python
>>> x.__next__()
1
>>> x.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in ? StopIteration
>>>
```
*Observation: A generator function implements the same low-level protocol that the for statements uses on lists, tuples, dicts, files, etc.*
## Exercises
### (a) A Simple Generator
If you ever find yourself wanting to customize iteration, you should
always think generator functions. They're easy to write---make
a function that carries out the desired iteration logic and use `yield`
to emit values.
For example, try this generator that searches a file for lines containing
a matching substring:
```python
>>> def filematch(filename, substr):
with open(filename, 'r') as f:
for line in f:
if substr in line:
yield line
>>> for line in open('Data/portfolio.csv'):
print(line, end='')
name,shares,price
"AA",100,32.20
"IBM",50,91.10
"CAT",150,83.44
"MSFT",200,51.23
"GE",95,40.37
"MSFT",50,65.10
"IBM",100,70.44
>>> for line in filematch('Data/portfolio.csv', 'IBM'):
print(line, end='')
"IBM",50,91.10
"IBM",100,70.44
>>>
```
This is kind of interesting--the idea that you can hide a bunch of
custom processing in a function and use it to feed a for-loop.
The next example looks at a more unusual case.
### (b) Monitoring a streaming data source
Generators can be an interesting way to monitor real-time data sources
such as log files or stock market feeds. In this part, we'll
explore this idea. To start, follow the next instructions carefully.
The program `Data/stocksim.py` is a program that
simulates stock market data. As output, the program constantly writes
real-time data to a file `stocklog.csv`. In a
separate command window go into the `Data/` directory and run this program:
```bash
bash % python3 stocksim.py
```
If you are on Windows, just locate the `stocksim.py` program and
double-click on it to run it. Now, forget about this program (just
let it run). Using another window, look at the file
`Data/stocklog.csv` being written by the simulator. You should see
new lines of text being added to the file every few seconds. Again,
just let this program run in the background---it will run for several
hours (you shouldn't need to worry about it).
Once the above program is running, let's write a little program to
open the file, seek to the end, and watch for new output. Create a
file `follow.py` and put this code in it:
```python
# follow.py
import os
import time
f = open('Data/stocklog.csv')
f.seek(0, os.SEEK_END) # Move file pointer 0 bytes from end of file
while True:
line = f.readline()
if line == '':
time.sleep(0.1) # Sleep briefly and retry
continue
fields = line.split(',')
name = fields[0].strip('"')
price = float(fields[1])
change = float(fields[4])
if change < 0:
print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
```
If you run the program, you'll see a real-time stock ticker. Under the hood,
this code is kind of like the Unix `tail -f` command that's used to watch a log file.
Note: The use of the `readline()` method in this example is
somewhat unusual in that it is not the usual way of reading lines from
a file (normally you would just use a `for`-loop). However, in
this case, we are using it to repeatedly probe the end of the file to
see if more data has been added (`readline()` will either
return new data or an empty string).
### (c) Using a generator to produce data
If you look at the code in part (b), the first part of the code is producing
lines of data whereas the statements at the end of the `while` loop are consuming
the data. A major feature of generator functions is that you can move all
of the data production code into a reusable function.
Modify the code in part (b) so that the file-reading is performed by
a generator function `follow(filename)`. Make it so the following code
works:
```python
>>> for line in follow('Data/stocklog.csv'):
print(line, end='')
... Should see lines of output produced here ...
```
Modify the stock ticker code so that it looks like this:
```python
if __name__ == '__main__':
for line in follow('Data/stocklog.csv'):
fields = line.split(',')
name = fields[0].strip('"')
price = float(fields[1])
change = float(fields[4])
if change < 0:
print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
```
### (d) Watching your portfolio
Modify the `follow.py` program so that it watches the stream of stock
data and prints a ticker showing information for only those stocks
in a portfolio. For example:
```python
if __name__ == '__main__':
import report
portfolio = report.read_portfolio('Data/portfolio.csv')
for line in follow('Data/stocklog.csv'):
fields = line.split(',')
name = fields[0].strip('"')
price = float(fields[1])
change = float(fields[4])
if name in portfolio:
print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
----
Note: For this to work, your `Portfolio` class must support the
`in` operator. See the last exercise and make sure you implement the
`__contains__()` operator.
### Discussion
Something very powerful just happened here. You moved an interesting iteration pattern
(reading lines at the end of a file) into its own little function. The `follow()` function
is now this completely general purpose utility that you can use in any program. For
example, you could use it to watch server logs, debugging logs, and other similar data sources.
That's kind of cool.
[Next](03_Producers_consumers)

View File

@@ -0,0 +1,301 @@
# 6.3 Producers, Consumers and Pipelines
Generators are a useful tool for setting various kinds of producer/consumer
problems and dataflow pipelines. This section discusses that.
### Producer-Consumer Problems
Generators are closely related to various forms of *producer-consumer*.
```python
# Producer
def follow(f):
...
while True:
...
yield line # Produces value in `line` below
...
# Consumer
for line in follow(f): # Consumes vale from `yield` above
...
```
`yield` produces values that `for` consumes.
### Generator Pipelines
You can use this aspect of generators to set up processing pipelines (like Unix pipes).
*producer* &rarr; *processing* &rarr; *processing* &rarr; *consumer*
Processing pipes have an initial data producer, some set of intermediate processing stages and a final consumer.
**producer** &rarr; *processing* &rarr; *processing* &rarr; *consumer*
```python
def producer():
...
yield item
...
```
The producer is typically a generator. Although it could also be a list of some other sequence.
`yield` feeds data into the pipeline.
*producer* &rarr; *processing* &rarr; *processing* &rarr; **consumer**
```python
def consumer(s):
for item in s:
...
```
Consumer is a for-loop. It gets items and does something with them.
*producer* &rarr; **processing** &rarr; **processing** &rarr; *consumer*
```python
def processing(s:
for item in s:
...
yield newitem
...
```
Intermediate processing stages simultaneously consume and produce items.
They might modify the data stream.
They can also filter (discarding items).
*producer* &rarr; *processing* &rarr; *processing* &rarr; *consumer*
```python
def producer():
...
yield item # yields the item that is received by the `processing`
...
def processing(s:
for item in s: # Comes from the `producer`
...
yield newitem # yields a new item
...
def consumer(s):
for item in s: # Comes from the `processing`
...
```
Code to setup the pipeline
```python
a = producer()
b = processing(a)
c = consumer(b)
```
You will notice that data incrementally flows through the different functions.
## Exercises
For this exercise the `stocksim.py` program should still be running in the background.
Youre going to use the `follow()` function you wrote in the previous exercise.
### (a) Setting up a simple pipeline
Let's see the pipelining idea in action. Write the following
function:
```python
>>> def filematch(lines, substr):
for line in lines:
if substr in line:
yield line
>>>
```
This function is almost exactly the same as the first generator
example in the previous exercise except that it's no longer
opening a file--it merely operates on a sequence of lines given
to it as an argument. Now, try this:
```
>>> lines = follow('Data/stocklog.csv')
>>> ibm = filematch(lines, 'IBM')
>>> for line in ibm:
print(line)
... wait for output ...
```
It might take awhile for output to appear, but eventually you
should see some lines containing data for IBM.
### (b) Setting up a more complex pipeline
Take the pipelining idea a few steps further by performing
more actions.
```
>>> from follow import follow
>>> import csv
>>> lines = follow('Data/stocklog.csv')
>>> rows = csv.reader(lines)
>>> for row in rows:
print(row)
['BA', '98.35', '6/11/2007', '09:41.07', '0.16', '98.25', '98.35', '98.31', '158148']
['AA', '39.63', '6/11/2007', '09:41.07', '-0.03', '39.67', '39.63', '39.31', '270224']
['XOM', '82.45', '6/11/2007', '09:41.07', '-0.23', '82.68', '82.64', '82.41', '748062']
['PG', '62.95', '6/11/2007', '09:41.08', '-0.12', '62.80', '62.97', '62.61', '454327']
...
```
Well, that's interesting. What you're seeing here is that the output of the
`follow()` function has been piped into the `csv.reader()` function and we're
now getting a sequence of split rows.
### (c) Making more pipeline components
Let's extend the whole idea into a larger pipeline. In a separate file `ticker.py`,
start by creating a function that reads a CSV file as you did above:
```python
# ticker.py
from follow import follow
import csv
def parse_stock_data(lines):
rows = csv.reader(lines)
return rows
if __name__ == '__main__':
lines = follow('Data/stocklog.csv')
rows = parse_stock_data(lines)
for row in rows:
print(row)
```
Write a new function that selects specific columns:
```
# ticker.py
...
def select_columns(rows, indices):
for row in rows:
yield [row[index] for index in indices]
...
def parse_stock_data(lines):
rows = csv.reader(lines)
rows = select_columns(rows, [0, 1, 4])
return rows
```
Run your program again. You should see output narrowed down like this:
```
['BA', '98.35', '0.16']
['AA', '39.63', '-0.03']
['XOM', '82.45','-0.23']
['PG', '62.95', '-0.12']
...
```
Write generator functions that convert data types and build dictionaries.
For example:
```python
# ticker.py
...
def convert_types(rows, types):
for row in rows:
yield [func(val) for func, val in zip(types, row)]
def make_dicts(rows, headers):
for row in rows:
yield dict(zip(headers, row))
...
def parse_stock_data(lines):
rows = csv.reader(lines)
rows = select_columns(rows, [0, 1, 4])
rows = convert_types(rows, [str, float, float])
rows = make_dicts(rows, ['name', 'price', 'change'])
return rows
...
```
Run your program again. You should now a stream of dictionaries like this:
```
{ 'name':'BA', 'price':98.35, 'change':0.16 }
{ 'name':'AA', 'price':39.63, 'change':-0.03 }
{ 'name':'XOM', 'price':82.45, 'change': -0.23 }
{ 'name':'PG', 'price':62.95, 'change':-0.12 }
...
```
### (d) Filtering data
Write a function that filters data. For example:
```python
# ticker.py
...
def filter_symbols(rows, names):
for row in rows:
if row['name'] in names:
yield row
```
Use this to filter stocks to just those in your portfolio:
```python
import report
portfolio = report.read_portfolio('Data/portfolio.csv')
rows = parse_stock_data(follow('Data/stocklog.csv'))
rows = filter_symbols(rows, portfolio)
for row in rows:
print(row)
```
### (e) Putting it all together
In the `ticker.py` program, write a function `ticker(portfile, logfile, fmt)`
that creates a real-time stock ticker from a given portfolio, logfile,
and table format. For example::
```python
>>> from ticker import ticker
>>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'txt')
Name Price Change
---------- ---------- ----------
GE 37.14 -0.18
MSFT 29.96 -0.09
CAT 78.03 -0.49
AA 39.34 -0.32
...
>>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'csv')
Name,Price,Change
IBM,102.79,-0.28
CAT,78.04,-0.48
AA,39.35,-0.31
CAT,78.05,-0.47
...
```
### Discussion
Some lessons learned: You can create various generator functions and
chain them together to perform processing involving data-flow
pipelines. In addition, you can create functions that package a
series of pipeline stages into a single function call (for example,
the `parse_stock_data()` function).
[Next](04_More_generators)

View File

@@ -0,0 +1,179 @@
# 6.4 More Generators
This section introduces a few additional generator related topics including
generator expressions and the itertools module.
### Generator Expressions
A generator version of a list comprehension.
```python
>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b:
... print(i, end=' ')
...
2 4 6 8
>>>
```
Differences with List Comprehensions.
* Does not construct a list.
* Only useful purpose is iteration.
* Once consumed, can't be reused.
General syntax.
```python
(<expression> for i in s if <conditional>)
```
It can also serve as a function argument.
```python
sum(x*x for x in a)
```
It can be applied to any iterable.
```python
>>> a = [1,2,3,4]
>>> b = (x*x for x in a)
>>> c = (-x for x in b)
>>> for i in c:
... print(i, end=' ')
...
-1 -4 -9 -16
>>>
```
The main use of generator expressions is in code that performs some
calculation on a sequence, but only uses the result once. For
example, strip all comments from a file.
```python
f = open('somefile.txt')
lines = (line for line in f if not line.startswith('#'))
for line in lines:
...
f.close()
```
With generators, the code runs faster and uses little memory. It's like a filter applied to a stream.
### Why Generators
* Many problems are much more clearly expressed in terms of iteration.
* Looping over a collection of items and performing some kind of operation (searching, replacing, modifying, etc.).
* Processing pipelines can be applied to a wide range of data processing problems.
* Better memory efficiency.
* Only produce values when needed.
* Contrast to constructing giant lists.
* Can operate on streaming data
* Generators encourage code reuse
* Separates the *iteration* from code that uses the iteration
* You can build a toolbox of interesting iteration functions and *mix-n-match*.
### `itertools` module
The `itertools` is a library module with various functions designed to help with iterators/generators.
```python
itertools.chain(s1,s2)
itertools.count(n)
itertools.cycle(s)
itertools.dropwhile(predicate, s)
itertools.groupby(s)
itertools.ifilter(predicate, s)
itertools.imap(function, s1, ... sN)
itertools.repeat(s, n)
itertools.tee(s, ncopies)
itertools.izip(s1, ... , sN)
```
All functions process data iteratively.
They implement various kinds of iteration patterns.
More information at [Generator Tricks for Systems Programmers](http://www.dabeaz.com/generators/) tutorial from PyCon '08.
## Exercises
In the previous exercises, you wrote some code that followed lines being written to a log file and parsed them into a sequence of rows.
This exercise continues to build upon that. Make sure the `Data/stocksim.py` is still running.
### (a) Generator Expressions
Generator expressions are a generator version of a list comprehension.
For example:
```python
>>> nums = [1, 2, 3, 4, 5]
>>> squares = (x*x for x in nums)
>>> squares
<generator object <genexpr> at 0x109207e60>
>>> for n in squares:
... print(n)
...
1
4
9
16
25
```
Unlike a list a comprehension, a generator expression can only be used once.
Thus, if you try another for-loop, you get nothing:
```python
>>> for n in squares:
... print(n)
...
>>>
```
### (b) Generator Expressions in Function Arguments
Generator expressions are sometimes placed into function arguments.
It looks a little weird at first, but try this experiment:
```python
>>> nums = [1,2,3,4,5]
>>> sum([x*x for x in nums]) # A list comprehension
55
>>> sum(x*x for x in nums) # A generator expression
55
>>>
```
In the above example, the second version using generators would
use significantly less memory if a large list was being manipulated.
In your `portfolio.py` file, you performed a few calculations
involving list comprehensions. Try replacing these with
generator expressions.
### (c) Code simplification
Generators expressions are often a useful replacement for
small generator functions. For example, instead of writing a
function like this:
```python
def filter_symbols(rows, names):
for row in rows:
if row['name'] in names:
yield row
```
You could write something like this:
```python
rows = (row for row in rows if row['name'] in names)
```
Modify the `ticker.py` program to use generator expressions
as appropriate.