Ananta Follow Almost a Computer Engineer, author of Go Woogle, I write about tech and tutorials.

10: Dictionaries and Sets

Learning Outcomes

How to use dictionary / set objects and differences to list / tuple
How to access dictionary objects in iteration
Utility of dictionaries as hashmaps
Utility of sets in dimensionality reduction

This will become a table of contents (this text will be scraped).

Sets

Sets only contain unique items

>>> trade_ids = [12342, 324562, 12342, 36452, 54767]
>>> set(trade_ids)
{12342, 36452, 54767, 324562}

We can also iterate a set like:

>>> for i in trade_ids:
...     print(i, end=',')
12342,324562,12342,36452,54767,

Set differences

They are also denoted by braces {}. Sets are a mathematical construct and python also supports some set logic such as set differences

>>> trade_ids_expected = {12342, 36452, 54767, 324569}  # shorter way of defining sets
>>> unexpected_trade_ids = set(trade_ids) - trade_ids_expected
>>> unexpected_trade_ids
{324562}

we can also do it the other way round to look for missing trades

>>> missing_trade_ids = trade_ids_expected - set(trade_ids)
>>> missing_trade_ids
{324569}

These two operations can be particularly useful when validating the inputs to functions.

Sets items must be immutable

We can also iterate sets in the same way that we iterate lists and tuples. Objects can also be part of sets as long as they are immutable - i.e. unchanging. Recall that lists are mutable and tuples are immutable.

This means that we can have a set of tuples

>>> set((1, 2,), (3, 4,))
{(1, 2,), (3, 4,)}

but not a set of lists

>>> {[1, 2], [3, 4]}
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-2-a0ff115cb325> in <module>()
----> 1 {[1, 2], [3, 4]}


TypeError: unhashable type: 'list'

Dictionaries

Dictionaries are python’s version of that is known as a hash map or hash table in other languages. If anyone in an interview asks you for a hash map in python you’ll know they just mean a dict (also they probably don’t really know python that well!)

All this jargon means is a key-value lookup where the key is unambigouously unique. Think VLOOKUP in Excel but if there couldn’t be any keys that are identical.

We set up a dictionary with a key value pair like follows

>>> d = {
...     'akey': 'avalue',
...     'anotherkey': 'avalue'
... }

values can be anything.

I actually wrote a load of stuff about this but I deleted it because I think you shoudl get used looking at python documentation now you are more familiar with the language.

See the offical python guide on dict - don’t bother with dict comprehensions yet as we will come onto those but have a read of the dictionaries and looping techniques sections.

Example: Trades by Asset Class

Lets assume we have the following data which we read in from a csv into a pandas DataFrame. Lets also assume that your credit and commodity desks for some reason give you the trade_id as a str - this is very annoying for you but a typical problem.

Finally, for any more advanced readers this example is focused on dict and not pandas so we shall avoid using pandas for now

>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
...         ['rates', 3467457, 4568679.345],
...         ['rates', 56858, -6578965789.45],
...         ['fx', 93875, 67896789.34],
...         ['fx', 34896, -3464754.456],
...         ['fx', 30986, 0.3456457],
...         ['credit', '234537', 45765.456],
...         ['credit', '457568', -3455436.213],
...         ['credit', '3467457', 456546.034],
...         ['commodities', '93875', -34563456.23235],
...         ['commodities', '34457', 4560456.4567],
...         ['commodities', '457478', 4575678.345346],
...         ['equities', 3466, -457567.345],
...         ['equities', 564756, -12.93045],
...         ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data, columns=['risk', 'trade_id', 'dv01'])

How many trades are there per asset class with delta risk?

>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id in df.values:
...     trade_by_asset_class[asset_class] = trade_id

Lets now figure out what went wrong here… Remembering the stack method we see that there are too many values to unpack and that the arrow is on the for line (if you are useing pyCharm - you know who you are - then you may have no arrow!)

With iteration errors it is often easiest to index the first element to see why we couldn’t unpack it:

>>> df.values[0]
array(['rates', 346455, 568789.345], dtype=object)

Here we can see there are three items and we are trying to unpack to two elements asset_class and trade_id therefore we need a third element even if we don’t currently care about the delta! A standard way of creating throwaway elements is to use _ like

>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id, _ in df.values:
...     trade_by_asset_class[asset_class] = trade_id

but this doesn’t really help because each iteration we have overwritten the value!

>>> trade_by_asset_class
{'rates': 56858,
 'fx': 30986,
 'credit': '3467457',
 'commodities': '457478',
 'equities': 457568}

we therefore need to create a list as a value item and then append to the list - this is one of the most common dictionary structures.

>>> trade_by_asset_class = dict()
>>> for asset_class, trade_id, _ in df.values:
...     if asset_class not in trade_by_asset_class:
...         trade_by_asset_class[asset_class] = []
...     trade_by_asset_class[asset_class].append(trade_id)

Think about these operations if you have a large number of rows: The following should bve quicker have a think about why this might be the case…

>>> trade_by_asset_class = dict()
>>> for ac in set(df['risk']):
...     trade_by_asset_class[ac] = []
>>> for asset_class, trade_id, _ in df.values:
...     trade_by_asset_class[asset_class].append(trade_id)

which gives

>>> trade_by_asset_class
{'rates': [346455, 3467457, 56858],
 'fx': [93875, 34896, 30986],
 'commodities': ['93875', '34457', '457478'],
 'equities': [3466, 564756, 457568],
 'credit': ['234537', '457568', '3467457']}

we now have a structure for answering the question:

>>> for a, t in trade_by_asset_class.items():
...     print('risk: {:12s} trades: {:2d}'.format(a, len(t)))
risk: rates        trades:  3
risk: fx           trades:  3
risk: commodities  trades:  3
risk: equities     trades:  3
risk: credit       trades:  3

For more information string padding see: https://pyformat.info/#string_pad_align

Simplifying iterations with dictionaries

Lets imagine that the credit trading PnL system for some reason prepends '0s' on all the database ids under a length of 7 because some lunatic decided it looked nice in the 90s.

To link your PnL you will have to also prepend zeros to every trade_id

_Whilst cussing out Diana Bloggs who retired last year after a distinguished trading career; but yet who also royally screwed you with one decision she made as a grad on a drizzly friday morning in 1999

We can zero pad integers to a length of 7 like

>>> str(346).zfill(8)
'00000346'

A naiive way of doing this would be just to call one of the following

>>> df['trade_id_pad'] = df['trade_id'].astype(str).str.zfill(8)

This example shows us two new operations: Firstly that we can call .astype on a pandas.Series (a series is a single column of a DataFrame). Secondly, if a pandas series is a str type then we can call .str to access operations that are normally found within str object types.

Imagine now that this dataframe is $10^6$x larger

*WARNING if your laptop is aweful you may not want to run this next section

>>> df = pd.DataFrame(data*1e6, columns=['risk', 'trade_id', 'dv01'])

Timing this for me took about 10 seconds!

In [100]: %timeit df['trade_id'].astype(str).str.zfill(8)
10.5 s ± 464 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Lets now simplify the process by using the hashmap

In [101]: %%timeit
     ...: trade_ids = df['trade_id'].unique()  # pandas way to get unique items is fast
     ...: lookup = {}
     ...: for trade_id in trade_ids:
     ...:     lookup = {trade_id: str(trade_id).zfill(8)}
     ...: df['trade_id_pad'] = df['trade_id'].apply(lookup.get)
     ...: 
2.38 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here we see the .apply method in action. This is pandas version of a map. A map iterates a single function across an array of items. map actually exists in python as a default function and we can call it like:
1
2
>>> list(map(str, [3456, 4576, 7343]))
['3456', '4576', '7343']
and
1
2
>>> list(map(len, ['36', '45sd76', '7343']))
[2, 6, 4]
pandas.Series.apply works in the same way and in this example iterates the .get method of lookup across every item in the dataframe.

Here the lookup method is exceedingly fast and creating it only requires us to use the far slower line str(trade_id).zfill(8) 15 times instead of 15 million times!

Exercises

Exercise 10.1: Dimensionality reduction

This example aims to build on previous examples to reinforce the idea of hash maps for reducing complexity.

You are working on an end-of-day regulatory risk model that requires the revaluation of all trades (e.g. Basel III: FRTB Sensitivities Based Approach).

You have been instructed to calculate the present value (PV) as a new column in an Excel sheet. Someone else has done this and complained it was impossibly long and took over 40 hours to calculate. They have requested access to a compute grid to speed up their Excel sheet worth $10k per year.

Assume your pricing function to get the PV of the trade is this:

>>> import time
>>> import numpy as np
>>> np.random.seed(42)
>>> def my_pricing_function(trade_id):
...     """Gets the given a trade_id and returns a random pv"""
...     time.sleep(.1)
...     return 2e9 * np.random.random() - 1e9

and it is called in Excel something like =MY_PRICING_FUNCTION($B3) where $B3 references the trade_id and is dragged down the column B for all 10000 rows.

Assume we have already read the Excel sheet with python and it gives us a dataframe like below

>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
...         ['rates', 3467457, 4568679.345],
...         ['rates', 56858, -6578965789.45],
...         ['fx', 93875, 67896789.34],
...         ['fx', 34896, -3464754.456],
...         ['fx', 30986, 0.3456457],
...         ['credit', '234537', 45765.456],
...         ['credit', '457568', -3455436.213],
...         ['credit', '3467457', 456546.034],
...         ['commodities', '93875', -34563456.23235],
...         ['commodities', '34457', 4560456.4567],
...         ['commodities', '457478', 4575678.345346],
...         ['equities', 3466, -457567.345],
...         ['equities', 564756, -12.93045],
...         ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data * 10000, columns=['risk', 'trade_id', 'dv01'])

Currently this pricing function is being called like

>>> df['pv'] = df['trade_id'].apply(my_pricing_function)

Use your knowledge of dictionaries to reduce the problem set and claim a portion of the cost savings for your bonus.

# Solve me

Next Topic

11: Classes and Modules

01 Oct 2019

« 09: Lists and Tuples 11: Classes and Modules »

Go Woogle