Part 1, Ihaka’s recommendation’s for R.
http://www.stat.auckland.ac.nz/~ihaka/downloads/Compstat-2008.pdf
http://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf
-Intro. 2 parts, questions, blog, work, contact.
-Ihaka wrote 2 papers criticizing R and recommending a language update.
-I’m new to R, found it a bit confusing and frustrating at first, then it seemed easy but limited compared to other languages.
-given commercial and technical developments, Ihakas recommendations look very worthwhile to consider, and are also on my mind given my exposure to R after working in other languages. I am very glad to get this opportunity to listen to your feedback on Ihaka’s recommendations.
-Big data, Machine Learning, mobile and the still developing web app ecosystem, all show extremely strong commercial and technical development.
-Example of safe road app. Endless potential bio-medical examples.
-Despite studying and working in these areas, I was largely unaware of the power and importance of R until attending this meetup. Why is R relatively little understood or even known outside of the statistical community in comparison to other technologies like Hadoop?
-Hadoop business work seems to mostly be about simple word counting and similar so far. Some business people I talked to recently were struggling to find a good use for even word counting, and completely unaware of R and what it can do.
-R and statistics seems like the perfect fit for Big Data, but my impression is that there is a a surprisingly low usage level of R and statistics in business IT. There is a seemingly small number of statistics mavens involved, a huge number of IT people, a huge number of spread sheet users, and very few R users with a statistical skill level between the IT people and statistics mavens. This general problem area is often talked about on Big Data business blogs.
-My guess is that it’s because of the unusualness of R, the limitations explained by Ihaka, and the related lack of closer connection and communication with other programming language communities, including in business.
–
Ihaka’s concerns:
-His 2 main concerns could be summed up as that R is too slow, and that the design of the language is strictly limiting it’s current and future use with 1. big data, and 2. for innovation in statistical programming.
-Ihaka ‘easy to get answers from R, more difficult to learn to get answers in efficient way.’
-Ihaka asks wether it is a system we should be using for the indefinite future, and says that in his considered opinion the answer is no.
–
Ihaka on R’s limitations:
-No scalars, everything is a vector:
http://www.cyclismo.org/tutorial/R/types.html
> a <- 3
[1] 3
> a[0]
numeric(0)
> b <-c(a,2,3)
> b
[1] 3 2 3
> b[0]
numeric(0)
>trees <- read.csv("/rdata/treedata.csv")
> trees[0]
data frame with 0 columns and 126 rows
>
> trees[1]
C
1 1
2 1
3 1
4 1
5 1
6 1
7 1
> attributes(trees)
$names
[1] “C” “N” “CHBR” “REP” “LFBM” “STBM” “RTBM” “LFNCC”
[9] “STNCC” “RTNCC” “LFBCC” “STBCC” “RTBCC” “LFCACC” “STCACC” “RTCACC”
[17] “LFKCC” “STKCC” “RTKCC” “LFMGCC” “STMGCC” “RTMGCC” “LFPCC” “STPCC”
[25] “RTPCC” “LFSCC” “STSCC” “RTSCC”
$class
[1] “data.frame”
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54
-I found this confusing at first, after speeding through the tutorials and assuming it was like every other language in having both scalars and arrays. I also found it time consuming to connect to a database, and in dealing with spaces in the names of columns.
-R is convenient for easy and fast statistical work, but has become hardened in it’s current state due to the ongoing focus on library development.
-question-examples of what Ihaka referring to about the limitations of Vectors.
His answer from 2010 paper:
A second problem is that R has no scalar data types and users spend a good deal of time
looking for “cute” ways to vectorise problems which are much more naturally expressed in
a scalar way. The following code carries out a matching technique, returning the values in
the vector x closest to the values in y.
xymatch =
function(x, y) {
x[apply(outer(x, y, function(u, v) abs(u - v)),
1, function(d) which(d == min(d))[1])]
This is certainly a “cute” one-line solution, but the use of the generalised outer product
function outer expands the problem from an m + n scale to an m × n one. While it works
well for small problems, it is not a good way to solve large ones. It would be much more
efficient to use loops and scalar computation.
-Unboxing values in a vector is extremely computational expensive.
-Pass by value. Which means functions copy their arguments to work with them, thus taking up more computer time and memory.
-In addition, because R does not have a stream type, arguments must be limited in size to what can fit in memory. This is a decisive disadvantage in working with and within Big Data systems.
-Lack of compilation to machine code, which is a large limit on the speed of R code. Moving parts of R to C is limited in how much it can alleviate that issue, because of the design of the language. R to byte code compilation has helped by a factor of 5, but more speed is needed.
-Lack of ability to specify types, which is highly important for creating bug free library code, certain important types of code analysis, fast compilation, and the creation of mathematicaly well defined software-which is being significantly developed technically and commercially by the Haskell community.
—-
add sum of numbers example:
Implementation Time Performance factor relative to slowest
R interpreted 945.71 1
Python interpreted 385.19 2.50
Python reduce() function 122.10 7.75
Lisp no type declarations 65.99 14.33
Python built-in sum() 49.26 19.20
R built-in sum() 11.2 84.40
Lisp with type declarations 2.49 379.80
Java 1.66 569.70
C 1.66 569.70
—
random walk example:
Lisp R Machine characteristics
0.215 6.572 2.33Ghz/3GB Intel, Mac OS X
0.279 7.513 2.4Ghz/32GB AMD Opteron, Linux
0.488 8.304 1Ghz/2GB AMD Athlon, Linux
And to do this for 50
values of beta would take at least 4
days! This assumes that the computation
will complete and not run out of memory
The Lisp version takes 212 seconds for 10,000 iterations of 10,000 steps
for a given β. So for 50 values of β
, the expected completion time is 3 hours
in total.
—
Biham-Middleton-Levine traffic model
-says lisp outperformed R+C.
–
example of in memory issue:
The following example serves to illustrate the kinds of problem that R has. Dataframes
are a fundamental data structure which R uses to hold the traditional case/variable layouts
common in statistics. The problem, brought to me by a colleague, concerns the process of
updating the rows of a dataframe in a loop. A highly simplified version of this problem is
presented below.
n = 60000
r = 10000
d = data.frame(w = numeric(n), x = numeric(n),
y = numeric(n), z = numeric(n))
value = c(1, 2, 3, 4)
system.time({
for(i in 1:r) {
j = sample(n, 1)
d[j,] = value
}
})
This code fragment creates a dataframe with 60000 cases and 4 variables. A loop is run to
update the values in 10000 randomly selected rows of the dataframe.
On my machine, this computation runs in roughly 100 seconds. This is a long time for a
computation on such a small amount of data. On the scale that my colleague was working,
the performance was unacceptably slow.
Knowing a little about the internal working of R, I was able to advise my colleague to
simply take the variables out of the dataframe and update them one-by-one.
n = 60000
r = 10000
w = x = y = z = numeric(n)
value = c(1, 2, 3, 4)
system.time({
for(i in 1:r) {
j = sample(n, 1)
w[j] = value[1]
x[j] = value[2]
y[j] = value[3]
z[j] = value[4]
}
})
This change reduces the computation time from 100 seconds to .2 seconds and my col-
league found the 500-fold performance boost acceptable.
This change reduces the computation time from 100 seconds to .2 seconds and my col-
league found the 500-fold performance boost acceptable.
The problem in this case is caused by a combination of factors. R uses “call-by-value”
semantics which means that arguments to functions are (at least conceptually) copied before
any changes are made to them. The update
d[j,] = value
is carried out by a function call and so the dataframe d is copied. In fact it is copied multiple
times for each update.
This kind of behaviour is completely unacceptable for a data structure as fundamental
as data frames. The current approach to fixing this kind of problem is to move the update
process into C code, where copying can be controlled. This does alleviate the symptom,
but it does not fix the underlying problem which is likely to exhibit itself elsewhere.
–
Ihaka’s recommended answer:
-A new type of R language running on Common Lisp.
-CL can compile to fast machine code.
-CL is a well developed,flexible,powerful and highly capable programming language.
-It has been used for advanced projects like plane ticket scheduling and booking by ITA, which was purchased by Google for $700 Million.
-CL can be made to use either pass by value or pass by reference.
-CL can optionally specify types.
-CL is already in continual development by long time programming experts as a programming development platform, thus saving a lot of precious time for the limited in size statistical community.
-CL has a stream type, thus allowing out of memory computation.
-CL can mix compilation,interpretation, and has facilities for working with binaries from other languages, thus allowing a mixed approach to working with the current version of interpreted R and it’s libraries. A limited or even full version of the current R could also be created for CL.
-A new R syntax that he partially developed on CL, because he perceived CL’s syntax as being hard to deal with.
defun sum(x)
{
local s = 0
do i = 1, n {
s = s + x[i]
}
s
}
(defun sum (x)
(let ((s 0))
(doloop (i 1 (length x))
(setf s (+ s (elt x i))))
s))
-Version with types:
defun sum(double[*] x)
{
local s = 0
do i = 1, n {
s = s + x[i]
}
s
}
(defun sum (x)
(declare
(type (simple-array double(*))
x))
(let ((s 0))
(declare (type double s))
(doloop (i 1 (length x))
(setf s (+ s (elt x i))))
s))
-5 minute break for questions.
——-
Part 2, Thoughts on Ihaka’s recommendations.
-From the perspective of someone used to programming in other programming languages, a new and backwards compatible version of R in CL would have saved a lot of time, and for me at least would be more flexible and powerful to work with, based on my familiarity with CL alone.
-His arguments on the need for at least optional pass by reference, scalars, streams, typing and compilation would probably be considered to be fairly strong by most language experts, as most modern languages either have or can have those features, and they are generaly considered to be important and powerful features.
-I have a couple additional reasons for liking the idea of a new version of R based on Lisp:
-Lisp is very emnable to functional programming and parallelism because it’s code is in it’s own native data structure. It’s data and code both use the same format. That makes it easier to program the programming language, and get it into a format that reveals more parallelization.
-A lot of respected work is being done with Functional programming in Haskell, Scala and Lisp. Functional programming helps greatly in reducing bugs and reducing the size of programs. Many stock trading businesses now use Haskell because of it’s precision, analyzability, speed and shortness of code.
-Creating a new R in Lisp would be ideal for connecting with multiple other programming language communities, because virtually any language can be emulated in Lisp, and Lisp also has powerful facilities for connecting with other languages. For example a version of Lisp called Gambit Scheme can output C code. A version of Lisp called typed Scheme can be made to work in ways similar to Haskell. A version of Scheme called Kawa runs on the JVM. A few versions of Scheme run in Haskell.
-Getting the Haskell, Scala and Python communities involved with the R community would be of absolutely massive benefit to all involved, because each community has very large and specific resources to offer the others, both technically and commercially. In particular, the lack of the advanced and routine statistical programming I have seen in the R community, is glaringly less in those other language communities. They need R and R using statisticians.
-In my opinion, Scheme is a better choice than Common Lisp.
-Common Lisp is like a massive and complex tank. Scheme is like a bicycle that can be upgraded to a tank when needed.
-Scheme is much smaller and simpler than Common Lisp, and has an established academicaly oriented education program for it. Common Lisp was designed for, and is still used by, advanced commercial programmers for complex and varied projects.
-Scheme is an extension of a theoretical programming model called ‘The untyped Lambda Calculus.’ Haskell is an extension of the typed Lambda Calculus. Untyped languages are generaly easier to progrom solutions to new none well defined problems in, typed languages are generally better for avoiding bugs in programming solutions to well definied programs. In the past, programmers sometimes got caught up in arguments about which one was ‘better.’ Each has an important place, and the ability to use whichever one that’s needed is ideal. Being aware of this situation is important in dealing with various programming languages and their communities.
-The ideal Scheme to start with, is called Racket:
http://racket-lang.org/
-A great book and related summer workshop is available to learn Racket Scheme:
http://racket-lang.org/learning.html
-Racket Scheme is ideal for the blended approach of an R interpreter, compiled code, Scheme interpreted code, compilation to other languages, ffl connection to other languages, and interpreters for other languages. It’s the ideal cross language glue because of it’s Lisp aspects and small size. Racket Scheme supports both untyped and typed versions of Scheme.
-Attending the summer workshop would be an extremely valuable experience in working to connect the R community with other language and commercial communities that have a large and little understood need for statistical help.
-I’ll likely also be attending it this year.
-I’m working on Scheme related Big Data plaftform that has a significant cost competitive advantage over Hadoop, and networking for clients for it.
-I’m working on a new version of R in Scheme. It’s a secondary priority to my Big Data platfrom, so it’s not ready for release yet. It will look similar to this:
(R data <- c(1,2,3))
(R plot(data))
-data is a traditional R vector, but it can be picked apart by scheme code, and work with streams and big data.
(RN (+ data hstream))
or be converted to scalar’s:
(first (second data))
it can be typed:
(RNT (+ ‘(3 4 5) ([integer] data)))
-This format allows the blending of both interpreted and compiled Scheme, old R, new R, and other languages like C, Haskell, Java and eventually Python.
-Email me to be notified when it’s released.
-What are your thoughts on this New_R format?
-I am building the plotting part of it on D3.
-D3 is made up of existing web standards, is simple and human readable, and since it uses html5, is also good for use on mobile devices. These aspects make it ideal not only for R studio like use on the web, but also for the creation of R using web and mobile apps, and R libraries for use with web and mobile apps.
http://d3js.org/
Resize circles in a symbol map with a staggered delay:
d3.selectAll(“circle”).transition()
.duration(750)
.delay(function(d, i) { return i * 10; })
.attr(“r”, function(d) { return Math.sqrt(d * scale); });
-What are your thoughts on plotting standards and Latex?