 Ananta Almost a Computer Engineer, author of Go Woogle, I write about tech and tutorials.

# 10: Dictionaries and Sets ## Learning Outcomes

• How to use dictionary / set objects and differences to list / tuple
• How to access dictionary objects in iteration
• Utility of dictionaries as hashmaps
• Utility of sets in dimensionality reduction

## Sets

Sets only contain unique items

``````1
2
3
>>> trade_ids = [12342, 324562, 12342, 36452, 54767]
{12342, 36452, 54767, 324562}
``````

We can also iterate a set like:

``````1
2
3
...     print(i, end=',')
12342,324562,12342,36452,54767,
``````

### Set differences

They are also denoted by braces `{}`. Sets are a mathematical construct and `python` also supports some set logic such as set differences

``````1
2
3
4
>>> trade_ids_expected = {12342, 36452, 54767, 324569}  # shorter way of defining sets
{324562}
``````

we can also do it the other way round to look for missing trades

``````1
2
3
{324569}
``````

These two operations can be particularly useful when validating the inputs to functions.

### Sets items must be immutable

We can also iterate sets in the same way that we iterate lists and tuples. Objects can also be part of sets as long as they are immutable - i.e. unchanging. Recall that lists are mutable and tuples are immutable.

This means that we can have a set of tuples

``````1
2
>>> set((1, 2,), (3, 4,))
{(1, 2,), (3, 4,)}
``````

but not a set of lists

``````1
2
3
4
5
6
7
8
9
10
>>> {[1, 2], [3, 4]}
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-2-a0ff115cb325> in <module>()
----> 1 {[1, 2], [3, 4]}

TypeError: unhashable type: 'list'
``````

## Dictionaries

Dictionaries are python’s version of that is known as a hash map or hash table in other languages. If anyone in an interview asks you for a hash map in `python` you’ll know they just mean a `dict` (also they probably don’t really know `python` that well!)

All this jargon means is a key-value lookup where the key is unambigouously unique. Think `VLOOKUP` in Excel but if there couldn’t be any keys that are identical.

We set up a dictionary with a key value pair like follows

``````1
2
3
4
>>> d = {
...     'akey': 'avalue',
...     'anotherkey': 'avalue'
... }
``````

values can be anything.

I actually wrote a load of stuff about this but I deleted it because I think you shoudl get used looking at `python` documentation now you are more familiar with the language.

See the offical python guide on `dict` - don’t bother with dict comprehensions yet as we will come onto those but have a read of the dictionaries and looping techniques sections.

### Example: Trades by Asset Class

Lets assume we have the following data which we read in from a csv into a pandas DataFrame. Lets also assume that your credit and commodity desks for some reason give you the `trade_id` as a `str` - this is very annoying for you but a typical problem.

Finally, for any more advanced readers this example is focused on `dict` and not `pandas` so we shall avoid using `pandas` for now

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
...         ['rates', 3467457, 4568679.345],
...         ['rates', 56858, -6578965789.45],
...         ['fx', 93875, 67896789.34],
...         ['fx', 34896, -3464754.456],
...         ['fx', 30986, 0.3456457],
...         ['credit', '234537', 45765.456],
...         ['credit', '457568', -3455436.213],
...         ['credit', '3467457', 456546.034],
...         ['commodities', '93875', -34563456.23235],
...         ['commodities', '34457', 4560456.4567],
...         ['commodities', '457478', 4575678.345346],
...         ['equities', 3466, -457567.345],
...         ['equities', 564756, -12.93045],
...         ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data, columns=['risk', 'trade_id', 'dv01'])
``````

#### How many trades are there per asset class with delta risk?

``````1
2
3
>>> for asset_class, trade_id in df.values:
``````

Lets now figure out what went wrong here… Remembering the stack method we see that there are too many values to unpack and that the arrow is on the `for` line (if you are useing pyCharm - you know who you are - then you may have no arrow!)

With iteration errors it is often easiest to index the first element to see why we couldn’t unpack it:

``````1
2
>>> df.values
array(['rates', 346455, 568789.345], dtype=object)
``````

Here we can see there are three items and we are trying to unpack to two elements `asset_class` and `trade_id` therefore we need a third element even if we don’t currently care about the delta! A standard way of creating throwaway elements is to use `_` like

``````1
2
3
>>> for asset_class, trade_id, _ in df.values:
``````

but this doesn’t really help because each iteration we have overwritten the value!

``````1
2
3
4
5
6
{'rates': 56858,
'fx': 30986,
'credit': '3467457',
'commodities': '457478',
'equities': 457568}
``````

we therefore need to create a `list` as a value item and then append to the list - this is one of the most common dictionary structures.

``````1
2
3
4
5
>>> for asset_class, trade_id, _ in df.values:
...     if asset_class not in trade_by_asset_class:
``````

Think about these operations if you have a large number of rows: The following should bve quicker have a think about why this might be the case…

``````1
2
3
4
5
>>> for ac in set(df['risk']):
>>> for asset_class, trade_id, _ in df.values:
``````

which gives

``````1
2
3
4
5
6
{'rates': [346455, 3467457, 56858],
'fx': [93875, 34896, 30986],
'commodities': ['93875', '34457', '457478'],
'equities': [3466, 564756, 457568],
'credit': ['234537', '457568', '3467457']}
``````

we now have a structure for answering the question:

``````1
2
3
4
5
6
7
>>> for a, t in trade_by_asset_class.items():
...     print('risk: {:12s} trades: {:2d}'.format(a, len(t)))
``````

#### Simplifying iterations with dictionaries

Lets imagine that the credit trading PnL system for some reason prepends `'0s'` on all the database ids under a length of 7 because some lunatic decided it looked nice in the 90s.

To link your PnL you will have to also prepend zeros to every `trade_id`

_Whilst cussing out Diana Bloggs who retired last year after a distinguished trading career; but yet who also royally screwed you with one decision she made as a grad on a drizzly friday morning in 1999

We can zero pad integers to a length of 7 like

``````1
2
>>> str(346).zfill(8)
'00000346'
``````

A naiive way of doing this would be just to call one of the following

``````1
``````

This example shows us two new operations: Firstly that we can call `.astype` on a `pandas.Series` (a series is a single column of a DataFrame). Secondly, if a `pandas` series is a `str` type then we can call `.str` to access operations that are normally found within `str` object types.

Imagine now that this dataframe is \$10^6\$x larger

*WARNING if your laptop is aweful you may not want to run this next section

``````1
>>> df = pd.DataFrame(data*1e6, columns=['risk', 'trade_id', 'dv01'])
``````

Timing this for me took about 10 seconds!

``````1
2
10.5 s ± 464 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

Lets now simplify the process by using the hashmap

``````1
2
3
4
5
6
7
8
In : %%timeit
...: trade_ids = df['trade_id'].unique()  # pandas way to get unique items is fast
...: lookup = {}
...:
2.38 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

Here we see the `.apply` method in action. This is `pandas` version of a `map`. A map iterates a single function across an array of items. `map` actually exists in `python` as a default function and we can call it like:

``````1
2
>>> list(map(str, [3456, 4576, 7343]))
['3456', '4576', '7343']
``````

and

``````1
2
>>> list(map(len, ['36', '45sd76', '7343']))
[2, 6, 4]
``````

`pandas.Series.apply` works in the same way and in this example iterates the `.get` method of `lookup` across every item in the dataframe.

Here the lookup method is exceedingly fast and creating it only requires us to use the far slower line `str(trade_id).zfill(8)` 15 times instead of 15 million times!

### Exercises

#### Exercise 10.1: Dimensionality reduction

This example aims to build on previous examples to reinforce the idea of hash maps for reducing complexity.

You are working on an end-of-day regulatory risk model that requires the revaluation of all trades (e.g. Basel III: FRTB Sensitivities Based Approach).

You have been instructed to calculate the present value (PV) as a new column in an Excel sheet. Someone else has done this and complained it was impossibly long and took over 40 hours to calculate. They have requested access to a compute grid to speed up their Excel sheet worth \$10k per year.

Assume your pricing function to get the PV of the trade is this:

``````1
2
3
4
5
6
7
>>> import time
>>> import numpy as np
>>> np.random.seed(42)
...     """Gets the given a trade_id and returns a random pv"""
...     time.sleep(.1)
...     return 2e9 * np.random.random() - 1e9
``````

and it is called in Excel something like `=MY_PRICING_FUNCTION(\$B3)` where `\$B3` references the `trade_id` and is dragged down the column `B` for all 10000 rows.

Assume we have already read the Excel sheet with python and it gives us a dataframe like below

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> import pandas as pd
>>> data = [['rates', 346455, 568789.345],
...         ['rates', 3467457, 4568679.345],
...         ['rates', 56858, -6578965789.45],
...         ['fx', 93875, 67896789.34],
...         ['fx', 34896, -3464754.456],
...         ['fx', 30986, 0.3456457],
...         ['credit', '234537', 45765.456],
...         ['credit', '457568', -3455436.213],
...         ['credit', '3467457', 456546.034],
...         ['commodities', '93875', -34563456.23235],
...         ['commodities', '34457', 4560456.4567],
...         ['commodities', '457478', 4575678.345346],
...         ['equities', 3466, -457567.345],
...         ['equities', 564756, -12.93045],
...         ['equities', 457568, 546636.438996]]
>>> df = pd.DataFrame(data * 10000, columns=['risk', 'trade_id', 'dv01'])
``````

Currently this pricing function is being called like

``````1
``````1