Archive- original code library post #
Many recent reports have highlighted the need for analytic code generated within the health and social care system to be shared, most recently the Goldacre Report. There are many benefits to such a move, and ending the practice of “duplicative working behind closed doors” will improve efficiency by allowing the reuse of code as well as increasing the transparency of the analysis and methods employed. It is notable, however, that there are pockets of good practice all over the health and social care system and there are many great pieces of analytic work already shared in code. For example, the PHE produced COVID dashboard and the RCPCH Digital Growth Charts API Server . Many analysts and teams also share smaller pieces of work and useful packages for R and Python that lend themselves to reuse, for example the NHS-R community GitHub repo demos and how to’s which contains many snippets of code and longer scripts for a wide variety of purposes, for example common tasks reporting with RMarkdown, several scripts that interact with nationally held APIs of interest to healthcare analysts, and many other useful examples.
The Problem Being Solved #
The problem in the system at the moment is not only that individuals do not share their code- many do. A fundamental problem is that there has been no concerted effort around the knowledge management processes that would be necessary to produce one of the recommendations of the Goldacre report- “a curated national open library of NHS analyst code”. There are three main areas of improvement that could be made in order to produce a useful library of code that finds wide adoption in health and care analytic teams. Firstly, existing code that is shared with an open licence needs to be found, and its content (and the authors of the content) made more prominent in a nationally visible resource. Secondly, analysts that are already sharing code (and those that are not) need to be encouraged to share code that is of high quality and sufficiently intelligible and generic to be worthy of reuse. Code of this kind is likely to be of a higher standard regardless of whether it is shared and this goal should be represented to the analysts as something that will often produce better code in the interests of the analysts themselves. Thirdly, analysts who are already in the habit of writing analytic code and reusing open code (as well as those that are not) need to be encouraged to be critical consumers of the resource, using the code, finding fault with it, filing issues, making pull requests where appropriate, and so on.
The Proposed Activity Producing an open library of code would produce a highly useful resource for training as well as everyday practice and would not be resource intensive. The work could be carried out for 12 months full time by one data scientist, with adequate supervision from a senior data scientist. The work could be divided among individuals part time or carried out entirely by one individual depending on preference. There would be several components to the work to make it successful. In the first phase there would be wide engagement, firstly of individuals creating open source code and materials and secondly of individuals who do or might in the future make use of shared code (there would, of course, be overlap between these two groups). There would be an attempt to collate and summarise existing material as well as engagement with analytic teams in regard of what other resources might be useful, what languages they should be written in, the possible formats that they could be presented in, and so on. With the first engagement stage complete work could begin on the resource. It is likely to contain code written in at least two languages (R and Python) and may also include other languages (Julia, SQL, Bash…). It is highly likely that it would be presented as a book, with chapters, online, in a similar way to The Turing Way https://the-turing-way.netlify.app/welcome but there are other formats and standards that could be considered (for example, audio or video segments that link out to individual GitHub repos on particular subjects). The format and content of the work should be considered in detail in the engagement phase prior to this one. As the content begins to take shape, further engagement work could take place to ensure that the content is comprehensible and useful and to encourage analysts to use the work as well as to make contributions to it. A larger piece of work could also take place to find gaps in the library and to attempt to commission solutions or find existing ones. Where no solution can be found, the data scientist working on the library could write the code, or if it is too technical or difficult could attempt to induce an individual or team to produce it.
The Change It Would Deliver (how things would be better afterwards) This piece of work would bring about a very concrete change in that it would produce a highly useful tool for learning and training as well as for day to day work. The engagement and work which occurred in order to produce a diverse library from a range of individuals working in health and care would also be likely to lead to better communication between analysts, better visibility of open work and the individuals engaged in it, and hopefully lasting cultural change. Individuals engaged in producing and using the library would be more likely to write better, more generic code, and to seek open code solutions for their day to day tasks rather than writing their own. With the right support, the library could itself help to encourage other collaborative pieces of work based on the shared interests of the users and creators of the library (who, of course, overlap). This work would, it is hoped, feed back into the library too, creating a virtuous circle of open material and collaboration leading to larger pieces of collaborative analytic work, which themselves can be written into the library. When the funding is complete it is likely that community projects such as NHS-R community and NHS Pycom would be able to take over the curation and promotion of such a resource, andcould even accept fixed amounts of funding for special interest pieces within the library (for example, three months of funding to produce a special chapter on modern reproducible reporting methods).
Approximate costs (could be in FTEs e.g., “0.6FTE of an ABC for x months”
1 FTE of a B7/8A data scientist for 12 months (supervision for the member of staff by a senior data scientist could be provided pro bono, e.g. by a member of the NHS-R community).