A mixed experience with Julia and saving data frames

21 August 2021

After using Julia successfully in a few new projects, my latest came to a sticky end because of how it saves large data files.

I have used more languages than I remember for numerical work. About a year ago, I wrote a blog titled Which programming language is best for economic research: Julia, Matlab, Python or R? where I recommended Julia.

I am a fan, just used her successfully in a project I ported from Python/Pandas/Numpy. She was three times as fast as Python, and the code was a third of Python’s size. Fantastic.

So being all fired up, we tried Julia on a new project involving CRSP data. The CRSP date on my computer is about 1 gb compressed csv, and we will run a sequence of different estimation methods on it, each creating files of about 24 million rows and perhaps 7 columns, three integers and rest floating point. It might take a few hours, even days, to create each of these.

The next thing is to process the data, perhaps summary statistics by month, and this is when Julia shines. Doing the aggregate calculation with R data.table took 2.4 seconds while Julia dataframes needed only 1.18 seconds.

The R code

df[,.SD[MARKET_CAP < quantile(MARKET_CAP,0.10),
    list(length(alpha),mean(alpha), median(alpha),mean(A), 
    median(A), sd(RET),mean(MARKET_CAP))], 
    by = month
]

and Julia

combine(groupby(df, [:month])) do sdf
    dfs = sdf[sdf.MARKET_CAP .< quantile(sdf.MARKET_CAP,0.1),:]
    d=DataFrame(N=length(dfs.alpha),alpha_mean=mean(dfs.alpha,
        alpha_median=median(dfs.alpha) ,A_mean=mean(dfs.A),A_median=median(dfs.A),sd=std(dfs.RET),
        MARKET_CAP=mean(dfs.MARKET_CAP))
    return d
end

24 million rows and 7 columns is a lot of data and results in an uncompressed CSV file of around 2 gb, which must be synced to various computers and backed up.

The file size is important, especially when the files are this large. Supposing they get used for a paper, I will need to be confident I can read the files several years in the future, even decades, into the future.

So for any method for storing my results, I have three criteria, how are they compress, what is the file size and if I am confident I can read the file in the distant future.

One criteria missing is speed. When it takes 24 hours to generate some research output, it makes no difference whether saving the results takes one second or one minute.

RData files look good on all three criteria. I have some going back a decade, and they always open. Same with CSV files. But, csv is not the best file format for numerical data, as I will discuss below.

What about Julia? My data is in data frames, and I used the methods discussed in the excellent dataframe tutorial by Bogumił Kamiński and compared the various alternatives by file size and research suitability.

Language Command Compression Confidence file size
CSV from R   none very high 1.9 gb.
CSV from R   gzip very high 734 mb.
CSV from R   xz very high 220 mb.
CSV from Julia   none very high 2.1 gb.
CSV from Julia   xz very high 233 mb.
R save() gzip high 228 Mb.
R save() zx high 132 Mb.
Julia serialize() none very low 1.3 gb.
Julia serialize() xz very low 135 mb.
Julia Arrow.jl none high 1.3 gb.
Julia Arrow.jl xz high 128 mb.
Julia Parquet.jl none high 634 mb.
Julia Parquet.jl xz high 206 mb.
Julia JDF.jl gzip low 524 Mb.
Julia JLD.jl none low 1.3 gb.
Julia JLD2.jl none low 1.3 gb.

The file format with the highest confidence is uncompressed CSV, but that took 1.9 gb.

So makes sense to compress. We have tried all of the common methods, and at least for our data, xz is by far the best. Beating gzip, bz2 and zip by a large margin.

An xz compressed R DataFrame is the smallest, at 132 Mb. Since R write() and read() are built into the language, I am pretty confident these files will be readable for the foreseeable future.

Julia serialize() just writes the internal data frame representation to a disk and is not guaranteed to be readable between Julia versions. Hence very low confidence. It doesn’t offer compression, but by doing it externally with xz, the file size was reduced from 1.3 gb to 135 mb.

Julia CSV is a good choice, but the files are huge at 2.1 gb, and even externally xz compressed are 233 mb. gzip would be much worse.

Julia Arrow and Parquet might be a good choice, with high confidence, but these are huge files. Arrow isn’t compressed, but Parquet is supposed to be an efficient storage format. At least for my data it isn’t. One could, of course, compress them afterwards. The xz compressed arrow file was the smallest.

JLD and JLD2 do not appear to compress the files, and they seem to use unique/customized file formats and/or are sporadically maintained. Therefore, I would not trust my data to them.

JDF does compress, but not by a huge amount. Also, unique among all the methods, it writes a folder with one file per column and one with metadata. I could certainly see a use case, but the folders are annoying to me, and given the talk about format changes on the Github page, I am not confident about future readability.

So what to use? The easiest and safest would just be to do it all in R.

If I were to use Julia for this project, I would have to use CSV.write() or arrow with external (command line) xz compression.

It is tempting to use one of JLD/JLD2/JDF with external compression. I will not do that. If there are three competing packages for saving Julia data, I have no confidence the one I pick will be around and supported in a decade.

I think it would be important for Julia adoption if it allowed one to save compressed (maybe both gzip and xz) serialized variables that are guaranteed to be readable for the life of Julia as a language. This should not be provided in a third party package, which comes with additional risk, but should be a part of the core language, just like R save().

So why serialise and not just writing data as csv? Three reasons.

The first is that if the csv writer uses full numerical precision, the files become very large, and if it rounds like R fwrite() then one loses data (which made our life difficult a few weeks ago). You can see the difference in R and Julia csv write here, first line:

R,     PERMNO,date,alpha,A,RET,MARKET_CAP,X.m1
10241,19290510,2.98480615495697,6.21303834131563e-07,0.04287,368626,0.038961
Julia, PERMNO,date,alpha,A,RET,MARKET_CAP,X_m1
10241.0,1.929051e7,2.9848061549569698,6.213038341315627e-7,0.04287,368626.0,0.038961

Second, a data frame contains data in Julia’s internal format, which could be an integer, float, string, complex, or anything one can put into a data frame, today and in the future. I would not be confident that making a round-trip to CSV would make that data identical after the round-trip.

And finally, I keep a lot of metadata in dictionaries. So the data I work with is a dictionary with data frames, the variables, and versions used to generate the data, time, computer, and other relevant variables. In R one can just use save(List,file=”x.Rdata”,compress=”xz”). At least for my workflow, that is helpful.

csv just isn’t an efficient format for numerical data. You can see that by the 2.1 gb. file size for Julia’s csv export vs. 1.3 gb. for serialize. Neither is compressed. The R csv becomes 734mb and 220gb when compressed with gzip and xz, respectively. Still almost the double size of their serialised counterparts.

csv is useful for sharing data between applications and archival storage, but it is not an optimal format if one is only using a single application. Arrow and parquet might be better for dataframes, but don’t help with Dictionaries.

So, it would be best if Julia could create xz compressed stable serialised files of its variables.

And finally, if I were to make one recommendation to the excellent Julia development team it would be to ensure that a google query for “how to save data in Julia” only delivered the one recommended way, just like “how to save data in R” does. The confusing message from googling “how to save data in Julia” is why I wrote this blog.

© All rights reserved, Jon Danielsson, 2021