Tuesday, May 2, 2017

A R named list is not a python dictionary

In the previous post, mpiktas posted a comment that included this question "Isn't Python dictionary a named list in R?". The short answer is no, but the reason why is worth some more explanation.

At first sight, a named list does look a little bit as a python dictionary, it allows to retrieve values inside a list using names (i.e. strings) instead of by numerical index. The code below is also a good example on how it is possible to mix R and python code inside a Jupyter notebook

In [1]:
%load_ext rpy2.ipython
In [2]:
%R RL <- list('one'=1, 'two'=2, 'three'=3) ; RL[['one']] 
Out[2]:
array([ 1.])
In [3]:
PD = {'one': 1, 'two':2, 'three': 3} ; PD['one']
Out[3]:
1

But a dictionary in python can do more, the keys can be any immutable type, so you can use numbers and tuples for keys, even in one dictionary

In [4]:
PD = {'one': 1, 2:2, (3,'three') :  3 } ; PD[(3,'three')]
Out[4]:
3

And you can add by name into an existing dictionary.

In [5]:
PD['four'] = 4 ; PD
Out[5]:
{'one': 1, 2: 2, (3, 'three'): 3, 'four': 4}

You need a two step approach in R, or at least I don't know a single step approach to add a named element

In [6]:
%R RL[4] <- 4 ; names(RL)[4] <- 'four' ; names(RL)
Out[6]:
array(['one', 'two', 'three', 'four'], 
      dtype='<U5')

But there is a more fundamental difference, the python dictionary uses a hash table implementation so that access to a key is independent of the size of the dictionary, while the named list in R seems to use a sorted table of names. This makes access by names (very) slow compared to access by index in R. Here is an example, that also shows how to mix intelligently R and python.

The code builds (large) equivalent structures, and a random subset is defined using R (easier in R) and passed to the python code. Then we measure the access time using different approaches. In python, we use both a list and a dictionary.

In [7]:
import time
import pandas
nBits = [18,19,20,21]
columns = ["Size", "Python dict by key", "Python list by index", "R list by index", "R list by key"]
results = pandas.DataFrame(columns=columns, index=nBits)
nSamples = 1000
for nBit in [18,19,20,21]:
    n = (1<<nBit)
    results["Size"][nBit] = n
    %R -i nSamples -o samples -i n   samples <- sample(1:(n-1), nSamples, replace = FALSE) 
    %R -o keys                       keys <- as.character(samples)
    keys = list(keys)
    PL = list(range(n))
    start = time.perf_counter()
    S = sum([PL[i] for i in samples ])
    stop = time.perf_counter()
    results["Python list by index"][nBit] = (stop - start) / nSamples * 1e6
    PD = { str(x):x for x in PL}
    start = time.perf_counter()
    S = sum([PD[k] for k in keys])
    stop = time.perf_counter()
    results["Python dict by key"][nBit] = (stop - start) / nSamples * 1e6
    %R -i n -o RL RL <- as.list(1:n)
    %R            start <- Sys.time() ; S <- 0 ; for (i in samples) { S <- S + RL[[i]] } ; stop <- Sys.time()
    %R -o idelay  idelay <- stop - start
    %R            names(RL) <- as.character(RL)
    %R            start <- Sys.time() ; S <- 0 ; for (k in keys ) { S <- S + RL[[k]] } ; stop <- Sys.time()
    %R -o kdelay  kdelay <- stop - start  
    results["R list by index"][nBit] = idelay[0] / nSamples * 1e6
    results["R list by key"][nBit] = kdelay[0] / nSamples * 1e6
results
Out[7]:
Size Python dict by key Python list by index R list by index R list by key
18 262144 0.447617 0.257533 0.998974 1298
19 524288 0.659162 0.341844 2.00105 3028.03
20 1048576 0.758547 0.357429 8.03399 6471.99
21 2097152 0.857677 0.415936 1.02687 13096

The results have sometimes outliers, but the main result is that access by key in R is both slow and increasing with the list size.

There is more to say, in particular there are data structures in R that are based on hash tables, but they remain less flexible than a python dictionary. I am not the first person making this type of analysis, you can check e.g.

2 comments:

  1. To add a named element in a single step just do PD$four=4

    ReplyDelete
  2. R is notorious for inefficient looping and assignment. I think a good portion of your time measurement is taken up by things other than subsetting.

    ReplyDelete