One weird trick to getting column types right with read_csv
Using read_csv from the tidyverse is so easy that I didn’t bother to look at the readr documentation for a long time. However, I’m glad I did, because there is, as they say in the click bait world, one weird trick to get your column types right with read_csv. read_csv (or the other delimited file reading functions like read_tsv) does a brilliant job guessing what column types things are but by default it only looks at 1000 rows. Fine for most datasets, but actually I have more than one dataset where the first 1000 rows are missing, which doesn’t help the parser at all. So do it manually and get it right. But what a pain, all that typing, right? Wrong. Just do this:
<pre class="brush: r; title: ; notranslate" title="">
testSpec = read_csv("masterTest.csv")
And you’ll get this output automatically:
<pre class="brush: r; title: ; notranslate" title="">
Parsed with column specification:
cols(
TeamN = col_character(),
Time = col_integer(),
TeamC = col_double(),
Division = col_integer(),
Directorate = col_integer(),
Contacts = col_integer(),
HIS = col_character(),
Inpatient = col_character(),
District = col_character(),
SubDistrict = col_character(),
fftCategory = col_character()
)
You’re supposed to copy and paste that into a new call, putting right any mistakes. And in fact there is one, in this very spreadsheet, the parser incorrectly guesses that Inpatient is character when it is in fact integer- because the first 1000 rows are missing.
So just copy all that into a new call and fix the mistake, like this:
<pre class="brush: r; title: ; notranslate" title="">
testSpec = read_csv("masterTest.csv",
col_types =
cols(TeamN = col_character(),
Time = col_integer(),
TeamC = col_double(),
Division = col_integer(),
Directorate = col_integer(),
Contacts = col_integer(),
HIS = col_character(),
Inpatient = col_integer(),
District = col_character(),
SubDistrict = col_character(),
fftCategory = col_character()
))
If you’re still having problems, you can have a look using problems(testSpec).
Absolute pure genius. The more I use the tidyverse, the more I know about it, and the more I know about it, the more I love it.