Tuesday, May 2, 2017

A R named list is not a python dictionary

In the previous post, mpiktas posted a comment that included this question "Isn't Python dictionary a named list in R?". The short answer is no, but the reason why is worth some more explanation.

At first sight, a named list does look a little bit as a python dictionary, it allows to retrieve values inside a list using names (i.e. strings) instead of by numerical index. The code below is also a good example on how it is possible to mix R and python code inside a Jupyter notebook

In [1]:
%load_ext rpy2.ipython
In [2]:
%R RL <- list('one'=1, 'two'=2, 'three'=3) ; RL[['one']] 
Out[2]:
array([ 1.])
In [3]:
PD = {'one': 1, 'two':2, 'three': 3} ; PD['one']
Out[3]:
1

But a dictionary in python can do more, the keys can be any immutable type, so you can use numbers and tuples for keys, even in one dictionary

In [4]:
PD = {'one': 1, 2:2, (3,'three') :  3 } ; PD[(3,'three')]
Out[4]:
3

And you can add by name into an existing dictionary.

In [5]:
PD['four'] = 4 ; PD
Out[5]:
{'one': 1, 2: 2, (3, 'three'): 3, 'four': 4}

You need a two step approach in R, or at least I don't know a single step approach to add a named element

In [6]:
%R RL[4] <- 4 ; names(RL)[4] <- 'four' ; names(RL)
Out[6]:
array(['one', 'two', 'three', 'four'], 
      dtype='<U5')

But there is a more fundamental difference, the python dictionary uses a hash table implementation so that access to a key is independent of the size of the dictionary, while the named list in R seems to use a sorted table of names. This makes access by names (very) slow compared to access by index in R. Here is an example, that also shows how to mix intelligently R and python.

The code builds (large) equivalent structures, and a random subset is defined using R (easier in R) and passed to the python code. Then we measure the access time using different approaches. In python, we use both a list and a dictionary.

In [7]:
import time
import pandas
nBits = [18,19,20,21]
columns = ["Size", "Python dict by key", "Python list by index", "R list by index", "R list by key"]
results = pandas.DataFrame(columns=columns, index=nBits)
nSamples = 1000
for nBit in [18,19,20,21]:
    n = (1<<nBit)
    results["Size"][nBit] = n
    %R -i nSamples -o samples -i n   samples <- sample(1:(n-1), nSamples, replace = FALSE) 
    %R -o keys                       keys <- as.character(samples)
    keys = list(keys)
    PL = list(range(n))
    start = time.perf_counter()
    S = sum([PL[i] for i in samples ])
    stop = time.perf_counter()
    results["Python list by index"][nBit] = (stop - start) / nSamples * 1e6
    PD = { str(x):x for x in PL}
    start = time.perf_counter()
    S = sum([PD[k] for k in keys])
    stop = time.perf_counter()
    results["Python dict by key"][nBit] = (stop - start) / nSamples * 1e6
    %R -i n -o RL RL <- as.list(1:n)
    %R            start <- Sys.time() ; S <- 0 ; for (i in samples) { S <- S + RL[[i]] } ; stop <- Sys.time()
    %R -o idelay  idelay <- stop - start
    %R            names(RL) <- as.character(RL)
    %R            start <- Sys.time() ; S <- 0 ; for (k in keys ) { S <- S + RL[[k]] } ; stop <- Sys.time()
    %R -o kdelay  kdelay <- stop - start  
    results["R list by index"][nBit] = idelay[0] / nSamples * 1e6
    results["R list by key"][nBit] = kdelay[0] / nSamples * 1e6
results
Out[7]:
Size Python dict by key Python list by index R list by index R list by key
18 262144 0.447617 0.257533 0.998974 1298
19 524288 0.659162 0.341844 2.00105 3028.03
20 1048576 0.758547 0.357429 8.03399 6471.99
21 2097152 0.857677 0.415936 1.02687 13096

The results have sometimes outliers, but the main result is that access by key in R is both slow and increasing with the list size.

There is more to say, in particular there are data structures in R that are based on hash tables, but they remain less flexible than a python dictionary. I am not the first person making this type of analysis, you can check e.g.

Sunday, March 5, 2017

Python and R for code development

The previous post glossed about why I now prefer Python to write code, including for a module like logopt. This post explains in more details some specific differences where I prefer one of these two languages:
  • 0-based indexing in python versus 1-based indexing in R.  This may seem a small difference but for me, 0-based indexing is more natural and results in less off by one errors.  No less than Dijkstra opines with me on 0-based indexing.
  • = versus <- for assignment.  I like R approach here, and I would like to see more languages doing the same.  I still sometimes end up using = where I wanted ==.  If only R would allow <- in call arguments.
  • CRAN versus pypi
    • CRAN is much better for the user, the CRAN Task Views is a gold mine, and in general CRAN is a better repository, with higher quality packages.
    • But publishing one CRAN is simply daunting, and the reason logopt remained in R-Forge only.  The manual explaining how to write extensions is 178 pages long.
  • Python has better data structures, especially the Python dictionary is something I miss whenever I write in R.  Python has no native dataframe, but this is easily taken care of by importing pandas.
  • Object orientation is conceptually clean and almost easy to use in Python, less so in R.
  • Plotting is better in R.  There are some effort to make Python better in that area, especially for ease of use.  Matplotlib is powerful but difficult to master.
  • lm is a gem in R, the simplicity with which you can express the expressions you want to model is incredible
All in all, I prefer coding in Python.  This is a personal opinion of course, and R remains important because of some packages, but for more general purpose tasks, Python is simpler to use, and that translates in being more productive. 

Monday, February 20, 2017

Rebooting with Python and Jupyter

This blog has been inactive for a long time for essentially two reasons:

  • I was not very happy with the quality of the results
    • The source code was not showing very nicely
    • It was difficult to get a nice display, including for pictures and mathematical expressions
  • I started to use Python almost exclusively
    • R is a nice language, but it is not a general purpose language, some tasks are hard in R compared to Python
    • At the other hand, Python has steadily improved in the area of data processing, with pandas providing something equivalent to the R dataframe
But now, there is a good way to solve both problems, Jupyter notebooks combined with the ability to directly include HTML in a post.  So it is time for a reboot.

Jupyter was originally know as iPython but has evolved to support many programming languages, including R.  This allows now to develop a notebook, possibly based on multiple languages, then convert it for posting, while keeping the original notebook available for people that wants a more interactive experience.  The development process is much simpler that way that it used to be for earlier posts.

As an example, the rest of this post is this notebook converted to HTML.  Note that the notebook contains both R and Python code interacting in an almost seamless way.  How to achieve that result will be explained in later posts.

In [1]:
%load_ext rpy2.ipython
In [2]:
%%R -o x -o xik -o n -o pik
# figure 8.1 of Cover "Universal Portfolios"

library(logopt)
data(nyse.cover.1962.1984)
n <- nyse.cover.1962.1984
x <- coredata(nyse.cover.1962.1984)
xik <- x[,c("iroqu","kinar")]
nDays <- dim(xik)[1]
Days <- 1:nDays
pik <- apply(xik,2,cumprod)
plot(Days, pik[,"iroqu"], col="blue", type="l", 
     ylim=range(pik), main = '"iroqu" and "kinar"', ylab="")
lines(Days, pik[,"kinar"], col="red")
grid()
legend("topright",c('"iroqu"','"kinar"'),
       col=c("blue","red"),lty=c(1,1))
In [3]:
print(x)
print(type(x))
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.ion()
plt.figure(figsize=(6,4))
plt.plot(pik)
plt.grid()
[[ 1.01515  1.02765  1.04183 ...,  1.00578  0.99697  0.99752]
 [ 1.01493  1.04036  0.98905 ...,  1.00958  0.99088  1.00248]
 [ 1.       0.97629  0.97786 ...,  1.       1.02761  0.99752]
 ..., 
 [ 0.99029  0.9966   0.99605 ...,  0.99216  1.00461  0.99273]
 [ 0.99265  1.00683  1.      ...,  0.99209  1.02752  1.00366]
 [ 0.99753  1.00339  1.01984 ...,  1.01195  1.       0.99635]]
<class 'numpy.ndarray'>