10: Dictionaries and Sets
Learning Outcomes
- How to use dictionary / set objects and differences to list / tuple
- How to access dictionary objects in iteration
- Utility of dictionaries as hashmaps
- Utility of sets in dimensionality reduction
Contents
- This will become a table of contents (this text will be scraped).
Sets
Sets only contain unique items
1
2
3
>>> trade_ids = [12342, 324562, 12342, 36452, 54767]
>>> set(trade_ids)
{12342, 36452, 54767, 324562}
We can also iterate a set like:
1
2
3
>>> for i in trade_ids:
... print(i, end=',')
12342,324562,12342,36452,54767,
Set differences
They are also denoted by braces {}
. Sets are a mathematical construct and python
also supports some set logic such as set differences
1
2
3
4
>>> trade_ids_expected = {12342, 36452, 54767, 324569} # shorter way of defining sets
>>> unexpected_trade_ids = set(trade_ids) - trade_ids_expected
>>> unexpected_trade_ids
{324562}
we can also do it the other way round to look for missing trades
1
2
3
>>> missing_trade_ids = trade_ids_expected - set(trade_ids)
>>> missing_trade_ids
{324569}
These two operations can be particularly useful when validating the inputs to functions.
Sets items must be immutable
We can also iterate sets in the same way that we iterate lists and tuples. Objects can also be part of sets as long as they are immutable - i.e. unchanging. Recall that lists are mutable and tuples are immutable.
This means that we can have a set of tuples
1
2
>>> set((1, 2,), (3, 4,))
{(1, 2,), (3, 4,)}
but not a set of lists
1
2
3
4
5
6
7
8
9
10
>>> {[1, 2], [3, 4]}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-a0ff115cb325> in <module>()
----> 1 {[1, 2], [3, 4]}
TypeError: unhashable type: 'list'
Dictionaries
Dictionaries are python’s version of that is known as a hash map or hash table in other languages. If anyone in an interview asks you for a hash map in python
you’ll know they just mean a dict
(also they probably don’t really know python
that well!)
All this jargon means is a key-value lookup where the key is unambigouously unique. Think VLOOKUP
in Excel but if there couldn’t be any keys that are identical.
We set up a dictionary with a key value pair like follows
1
2
3
4
>>> d = {
... 'akey': 'avalue',
... 'anotherkey': 'avalue'
... }
values can be anything.
I actually wrote a load of stuff about this but I deleted it because I think you shoudl get used looking at python
documentation now you are more familiar with the language.
See the offical python guide on dict
- don’t bother with dict comprehensions yet as we will come onto those but have a read of the dictionaries and looping techniques sections.
Example: Trades by Asset Class
Lets assume we have the following data which we read in from a csv into a pandas DataFrame. Lets also assume that your credit and commodity desks for some reason give you the trade_id
as a str
- this is very annoying for you but a typical problem.
Finally, for any more advanced readers this example is focused on dict
and not pandas
so we shall avoid using pandas
for now
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
... ['rates', 3467457, 4568679.345],
... ['rates', 56858, -6578965789.45],
... ['fx', 93875, 67896789.34],
... ['fx', 34896, -3464754.456],
... ['fx', 30986, 0.3456457],
... ['credit', '234537', 45765.456],
... ['credit', '457568', -3455436.213],
... ['credit', '3467457', 456546.034],
... ['commodities', '93875', -34563456.23235],
... ['commodities', '34457', 4560456.4567],
... ['commodities', '457478', 4575678.345346],
... ['equities', 3466, -457567.345],
... ['equities', 564756, -12.93045],
... ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data, columns=['risk', 'trade_id', 'dv01'])
How many trades are there per asset class with delta risk?
1
2
3
>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id in df.values:
... trade_by_asset_class[asset_class] = trade_id
Lets now figure out what went wrong here… Remembering the stack method we see that there are too many values to unpack and that the arrow is on the for
line (if you are useing pyCharm - you know who you are - then you may have no arrow!)
With iteration errors it is often easiest to index the first element to see why we couldn’t unpack it:
1
2
>>> df.values[0]
array(['rates', 346455, 568789.345], dtype=object)
Here we can see there are three items and we are trying to unpack to two elements asset_class
and trade_id
therefore we need a third element even if we don’t currently care about the delta! A standard way of creating throwaway elements is to use _
like
1
2
3
>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id, _ in df.values:
... trade_by_asset_class[asset_class] = trade_id
but this doesn’t really help because each iteration we have overwritten the value!
1
2
3
4
5
6
>>> trade_by_asset_class
{'rates': 56858,
'fx': 30986,
'credit': '3467457',
'commodities': '457478',
'equities': 457568}
we therefore need to create a list
as a value item and then append to the list - this is one of the most common dictionary structures.
1
2
3
4
5
>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id, _ in df.values:
... if asset_class not in trade_by_asset_class:
... trade_by_asset_class[asset_class] = []
... trade_by_asset_class[asset_class].append(trade_id)
Think about these operations if you have a large number of rows: The following should bve quicker have a think about why this might be the case…
1
2
3
4
5
>>> trade_by_asset_class = dict()
>>> for ac in set(df['risk']):
... trade_by_asset_class[ac] = []
>>> for asset_class, trade_id, _ in df.values:
... trade_by_asset_class[asset_class].append(trade_id)
which gives
1
2
3
4
5
6
>>> trade_by_asset_class
{'rates': [346455, 3467457, 56858],
'fx': [93875, 34896, 30986],
'commodities': ['93875', '34457', '457478'],
'equities': [3466, 564756, 457568],
'credit': ['234537', '457568', '3467457']}
we now have a structure for answering the question:
1
2
3
4
5
6
7
>>> for a, t in trade_by_asset_class.items():
... print('risk: {:12s} trades: {:2d}'.format(a, len(t)))
risk: rates trades: 3
risk: fx trades: 3
risk: commodities trades: 3
risk: equities trades: 3
risk: credit trades: 3
For more information string padding see: https://pyformat.info/#string_pad_align
Simplifying iterations with dictionaries
Lets imagine that the credit trading PnL system for some reason prepends '0s'
on all the database ids under a length of 7 because some lunatic decided it looked nice in the 90s.
To link your PnL you will have to also prepend zeros to every trade_id
_Whilst cussing out Diana Bloggs who retired last year after a distinguished trading career; but yet who also royally screwed you with one decision she made as a grad on a drizzly friday morning in 1999
We can zero pad integers to a length of 7 like
1
2
>>> str(346).zfill(8)
'00000346'
A naiive way of doing this would be just to call one of the following
1
>>> df['trade_id_pad'] = df['trade_id'].astype(str).str.zfill(8)
This example shows us two new operations: Firstly that we can call
.astype
on apandas.Series
(a series is a single column of a DataFrame). Secondly, if apandas
series is astr
type then we can call.str
to access operations that are normally found withinstr
object types.
Imagine now that this dataframe is $10^6$x larger
*WARNING if your laptop is aweful you may not want to run this next section
1
>>> df = pd.DataFrame(data*1e6, columns=['risk', 'trade_id', 'dv01'])
Timing this for me took about 10 seconds!
1
2
In [100]: %timeit df['trade_id'].astype(str).str.zfill(8)
10.5 s ± 464 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lets now simplify the process by using the hashmap
1
2
3
4
5
6
7
8
In [101]: %%timeit
...: trade_ids = df['trade_id'].unique() # pandas way to get unique items is fast
...: lookup = {}
...: for trade_id in trade_ids:
...: lookup = {trade_id: str(trade_id).zfill(8)}
...: df['trade_id_pad'] = df['trade_id'].apply(lookup.get)
...:
2.38 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here we see the
.apply
method in action. This ispandas
version of amap
. A map iterates a single function across an array of items.map
actually exists inpython
as a default function and we can call it like:
1 2 >>> list(map(str, [3456, 4576, 7343])) ['3456', '4576', '7343']and
1 2 >>> list(map(len, ['36', '45sd76', '7343'])) [2, 6, 4]
pandas.Series.apply
works in the same way and in this example iterates the.get
method oflookup
across every item in the dataframe.
Here the lookup method is exceedingly fast and creating it only requires us to use the far slower line str(trade_id).zfill(8)
15 times instead of 15 million times!
Exercises
Exercise 10.1: Dimensionality reduction
This example aims to build on previous examples to reinforce the idea of hash maps for reducing complexity.
You are working on an end-of-day regulatory risk model that requires the revaluation of all trades (e.g. Basel III: FRTB Sensitivities Based Approach).
You have been instructed to calculate the present value (PV) as a new column in an Excel sheet. Someone else has done this and complained it was impossibly long and took over 40 hours to calculate. They have requested access to a compute grid to speed up their Excel sheet worth $10k per year.
Assume your pricing function to get the PV of the trade is this:
1
2
3
4
5
6
7
>>> import time
>>> import numpy as np
>>> np.random.seed(42)
>>> def my_pricing_function(trade_id):
... """Gets the given a trade_id and returns a random pv"""
... time.sleep(.1)
... return 2e9 * np.random.random() - 1e9
and it is called in Excel something like =MY_PRICING_FUNCTION($B3)
where $B3
references the trade_id
and is dragged down the column B
for all 10000 rows.
Assume we have already read the Excel sheet with python and it gives us a dataframe like below
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
... ['rates', 3467457, 4568679.345],
... ['rates', 56858, -6578965789.45],
... ['fx', 93875, 67896789.34],
... ['fx', 34896, -3464754.456],
... ['fx', 30986, 0.3456457],
... ['credit', '234537', 45765.456],
... ['credit', '457568', -3455436.213],
... ['credit', '3467457', 456546.034],
... ['commodities', '93875', -34563456.23235],
... ['commodities', '34457', 4560456.4567],
... ['commodities', '457478', 4575678.345346],
... ['equities', 3466, -457567.345],
... ['equities', 564756, -12.93045],
... ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data * 10000, columns=['risk', 'trade_id', 'dv01'])
Currently this pricing function is being called like
1
>>> df['pv'] = df['trade_id'].apply(my_pricing_function)
Use your knowledge of dictionaries to reduce the problem set and claim a portion of the cost savings for your bonus.
1
# Solve me
Next Topic