A R named list is not a python dictionary¶
In the previous post, mpiktas posted a comment that included this question "Isn't Python dictionary a named list in R?". The short answer is no, but the reason why is worth some more explanation.
At first sight, a named list does look a little bit as a python dictionary, it allows to retrieve values inside a list using names (i.e. strings) instead of by numerical index. The code below is also a good example on how it is possible to mix R and python code inside a Jupyter notebook
%load_ext rpy2.ipython
%R RL <- list('one'=1, 'two'=2, 'three'=3) ; RL[['one']]
PD = {'one': 1, 'two':2, 'three': 3} ; PD['one']
But a dictionary in python can do more, the keys can be any immutable type, so you can use numbers and tuples for keys, even in one dictionary
PD = {'one': 1, 2:2, (3,'three') : 3 } ; PD[(3,'three')]
And you can add by name into an existing dictionary.
PD['four'] = 4 ; PD
You need a two step approach in R, or at least I don't know a single step approach to add a named element
%R RL[4] <- 4 ; names(RL)[4] <- 'four' ; names(RL)
But there is a more fundamental difference, the python dictionary uses a hash table implementation so that access to a key is independent of the size of the dictionary, while the named list in R seems to use a sorted table of names. This makes access by names (very) slow compared to access by index in R. Here is an example, that also shows how to mix intelligently R and python.
The code builds (large) equivalent structures, and a random subset is defined using R (easier in R) and passed to the python code. Then we measure the access time using different approaches. In python, we use both a list and a dictionary.
import time
import pandas
nBits = [18,19,20,21]
columns = ["Size", "Python dict by key", "Python list by index", "R list by index", "R list by key"]
results = pandas.DataFrame(columns=columns, index=nBits)
nSamples = 1000
for nBit in [18,19,20,21]:
n = (1<<nBit)
results["Size"][nBit] = n
%R -i nSamples -o samples -i n samples <- sample(1:(n-1), nSamples, replace = FALSE)
%R -o keys keys <- as.character(samples)
keys = list(keys)
PL = list(range(n))
start = time.perf_counter()
S = sum([PL[i] for i in samples ])
stop = time.perf_counter()
results["Python list by index"][nBit] = (stop - start) / nSamples * 1e6
PD = { str(x):x for x in PL}
start = time.perf_counter()
S = sum([PD[k] for k in keys])
stop = time.perf_counter()
results["Python dict by key"][nBit] = (stop - start) / nSamples * 1e6
%R -i n -o RL RL <- as.list(1:n)
%R start <- Sys.time() ; S <- 0 ; for (i in samples) { S <- S + RL[[i]] } ; stop <- Sys.time()
%R -o idelay idelay <- stop - start
%R names(RL) <- as.character(RL)
%R start <- Sys.time() ; S <- 0 ; for (k in keys ) { S <- S + RL[[k]] } ; stop <- Sys.time()
%R -o kdelay kdelay <- stop - start
results["R list by index"][nBit] = idelay[0] / nSamples * 1e6
results["R list by key"][nBit] = kdelay[0] / nSamples * 1e6
results
The results have sometimes outliers, but the main result is that access by key in R is both slow and increasing with the list size.
There is more to say, in particular there are data structures in R that are based on hash tables, but they remain less flexible than a python dictionary. I am not the first person making this type of analysis, you can check e.g.