Opening data

Word cloud for open data

World Bank Group. Sources: Wikipaedia.org; opendatahandbook.org; worldbank.org (image cropped)

What are open data?

The World Bank Group. Sources Wikipaedia.org; opendatahandbook.org; worldbank.org (cropped image)
The Open Knowledge Foundation defines data as open ‘…if anyone is free to access, use, modify, and share it — subject, at most, to measures that preserve provenance and openness.’

The open data movement has roots in open access reforms spanning back to Ancient Greece, and more recently the open science movement which started in the 1950s, but it only manifested in a modern technological sense in this millennium. Open data share deep philosophical roots with other open movements, including the open source, open access, and open science movements. These movements believe that putting more resources and work in the public domain for others to use freely in a manner consistent with the Open Knowledge Foundation’s definition will accelerate research and development on a global scale.

The movement took a quantum leap forward in the early 2000s as technology thought leaders contributed to the open government movement.

Return to contents ⇒

Why make open data to the public?

Advocates cite that in addition to improving government efficiency and transparency, open data reduce corruption  and advance public policy analysis and formation by enabling the participation of citizenry. Open data spur innovation and development of improved or new products and services in the private sector.

A study by McKinsey & Company found that open data have the potential to generate more than $3 trillion a year in economic value across the education, health care, and transportation sectors, among others.

In 2009, on the first day of his first term, US President Barack Obama issued his Memorandum on Transparency and Open Government. This marked his commitment to ‘an unprecedented level of openness in Government’ which would eventually include the launching of data.gov as a public repository for federal government data and the passing of the Data Act focused on transparency in federal expenditure data. Within a similar timeframe, the United Kingdom (UK) launched data.gov.uk, providing another example of a progressive government setting a standard around data transparency and accessibility.

Return to contents ⇒

Development of open data

Principles of open government data

In late 2007, thirty open government advocates with global interests met in the United States, including technology and government policy notables, Tim O’Reilly and Lawrence Lessig, to formulate the 8 principles of open government data which provided a major catalyst and framework for the open data movement:

  1. Complete: All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
  2. Primary: Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
  3. Timely: Data is made available as quickly as necessary to preserve the value of the data.
  4. Accessible: Data is available to the widest range of users for the widest range of purposes.
  5. Machine processable: Data is reasonably structured to allow automated processing.
  6. Non-discriminatory: Data is available to anyone, with no requirement of registration.
  7. Non-proprietary: Data is available in a format over which no entity has exclusive control.
  8. License-free: Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

Follow-up action

Since 2007, thousands of national governments, non-governmental organizations, international governing bodies, research organisations, special interest groups, and local governments have embraced the open data movement. Open data standards and collective commitments adopted internationally such as the G8 Open Data Charter are proof that opening data is a shared prerogative worldwide.

Despite the potential of open data, a 2017 report by the World Wide Web Foundation found that only seven governments included a statement on open data by default in their policies, just one in four datasets had an open license and half of all datasets were machine-readable. Intellectual property, technology and data hygiene pose significant barriers to adopting and implementing open data initiatives. Intellectual property restrictions increase alongside advances in data-sharing processes.

In the health sector, the complexities of protected health information and sensitive personal data add a layer of difficulty that slows its adoption of open data principles.

Return to contents ⇒

Open data in the health sector

In the health sector, the open data movement has grown in parallel with the concept of big data. Open data systems promise opportunities ranging from generating early warning for outbreaks and pandemics, through offering personalised medicine to individuals, to supporting health system management.

Degrees of openness

There are varying degrees of openness of health data, namely:

  • Open data files which anyone can freely download and analyse
  • Restricted files which people must request permission to download and use
  • Data that users can only interrogate using an analytic tool available on the website.

The most restrictive categories apply to data sets that consist of individual health-related records of disease incidence/prevalence, treatment, compliance and outcomes.

Openness for health and health-related data

Data providers remove individual identifiers before rendering the data available to external users. Health data may be:

  • Anonymised survey or research records of people, health events, specimens, households, facilities, resources and so on
  • Linked anonymized patient records and specimens from health facilities and registries
  • Aggregated data such as mortality rates or numbers of health workers per hospital, district or country
  • Assorted information gathered and linked through social media or crowd-sourcing platforms.

Health-related open data are available from different sectors, for example census data, economic, employment and education survey data, and climate data. Files of open health data are available on data.gov websites, academic journal websites, institutional websites, United Nations agency websites, or general purpose websites, for example:

Monitoring and surveillance of infectious diseases

Public Health England publishes monthly the number of methicillin resistant staphylococcus aureus (MRSA) infections in UK hospitals on data.gov.uk. Using these data, hospitals can compare figures and share best practices.

Linked clinical data

The Danish National Patient Registry (DNPR) links patient data and publishes them for research, under strict conditions of individual confidentiality. DNPR collects longitudinal administrative and clinical data for patients discharged from Danish Hospitals, including, for example over 8 million people between 1977 and 2012.

Cross-sectional government health surveys

Countries that maintain data.gov websites usually publish national health survey data for researchers to analyse. For example, the US Behavioral Risk Factor Surveillance System undertakes telephone surveys of US residents about their risk behaviours, chronic health conditions, and use of preventive healthcare services.

Cross-sectional data from multiple international sites

The USAID-funded Demographic and Health Surveys (DHS) Program has collaborated with over 90 countries to undertake more than 300 cross-sectional surveys over 30 years. Every survey uses the same set of questionnaire modules, with common metadata and statistical analyses. Datasets are freely available on completion of a short registration form, and the DHS website offers a customized tool to analyse aggregated indicators within or across surveys.

Longitudinal survey data from multiple international sites

The International Network for the Demographic Evaluation of Populations and Their Health (INDEPTH) has created a data repository which includes harmonized longitudinal datasets of health and demographic events in geographically defined populations studied by the network’s research centres in 20 countries across Africa, Asia and the Pacific region.

Kostkova et.al. propose that: Ultimately, healthcare policymakers at international level need to develop a shared policy and regulatory framework supporting a balanced agenda that safeguards personal information, limits business exploitations, and gives out a clear message to the public while enabling the use of data for research and commercial use.

One such example is the International Code of Conduct for genomic and health-related data sharing.The Code comprise six core elements, including: transparency; accountability; data security and quality; privacy, data protection and confidentiality;  minimising harm and maximising benefits; recognition and attribution; sustainability; accessibility and dissemination.

Return to contents ⇒

The open data progression model

We have developed the Open Data Progression Model to provide stages for governments and organisations to follow in making their data open.

Although there is consensus about best practices around an effective open data programme, there is less agreement about the sequences to develop open data programmes. There are compelling arguments as to why one stage could precede another, and many of these stages overlap or cycle between each other, but in our experience the Open Data Progression Model minimizes repetition and maximizes utility of the data.

Stage 1 – Collect the data

Data collection is the foundation on which to build an open data programme. The success of any downstream use of the data depends on their quality and completeness.

Other topics on this website describe methods for collecting health data for specific purposes. We emphasize the additional information that investigators need to collect and provide to assist others to use their data, bearing in mind that they may not be subject specialists. For example, investigators must make sure that they capture data fields that potential users need to understand and validate the data, and use common data standards and schemas whenever possible.

Open data source solutions

Some significant open source solutions provide tools to make data collection and storage easier and more efficient. These software use open source which a community of developers, implementers, and users continually improve and develop. Tools include built-in collection forms and surveys combined with data storage and data collection on mobile devices which can synchronize and aggregate data to a central server. For example:

Open Data Kit community produces free and open-source software for collecting, managing, and using data in resource-constrained environments.

KoBoToolbox is a suite of open source tools for field data collection for use in challenging environments.

Epi InfoTM  is a public domain suite of interoperable software tools designed for the global community of public health practitioners and researchers. It provides for easy data entry form and database construction, a customized data entry experience, and data analyses with epidemiologic statistics, maps, and graphs for public health professionals who may lack an information technology background.

District Health Information Software 2 (DHIS2) is an open source, web-based health management information system platform designed to assist governments and other organizations in their decision-making.

Stage 2 – Document the data

People who work with open data commonly complain that documentation does not provide sufficient description of context, making it difficult to understand a dataset and to determine if it is useful. Providing metadata – or information about data – is critical to helping people understand and validate data, and to encourage usage. The following represent the most critical context issues to capture and share:

Provenance

What is the origin and source of the data?  Who collected and aggregated them?  Has anyone changed the data since their original collection?  By whom?  When?  How?  What is the lineage of the data

License

Who claims ownership of the dataset?  What is the license of the dataset?  What are its terms of use? Publishing the license clearly alongside the dataset is an absolute must.

Collection Methodology

How did enumerators collect these data?  Did they capture the data using an electronic system or manually?  What was the population from which they collected the data?  Over what time-period?

Database schema

How have data managers organised the data?  If there are multiple files in the dataset, what is the relationship among the files?

Data Dictionary

What does each item of data mean?  What do key abbreviations mean?  Do identifier codes need to be translated?

Stage 3 – Open the data

There are two dimensions to making the data open:

Publishing the data

The two primary criteria to use when choosing where to publish online are:

Visibility: Topical or geographical open data portals often have the infrastructure to release data rapidly and with high visibility. General purpose open data portals include: data.world which has a broad catalogue of open data on different topics and a large community of users; ckan, Socrata, and OpenDataSoft specialise in helping organisations custom build and manage their own open data portals.

Utility: Functionality of the platform is key to assist consumers understand, access, and work with the data. Consider whether the open data portals has any capabilities for consumers to explore data quickly, or  whether the platform offers. Application Programming Interface (API) access enables consumers to programmatically pull the data directly into software tools that they use. APIs are increasingly the means to transfer data are at scale among tools and systems, and are a big part of what makes the data genuinely accessible in a technical sense.

Selecting the license

The absence of a license or the selection of a restrictive or custom license are among the main reasons why open data programmes fail to have their potential impact. Owners should either clearly relinquish all rights to their datasets and dedicate them to the public domain by noting public domain alongside the datasets or select an open recognized license for all their datasets. Licenses developed by the Creative Commons are now the licenses of choice among dataset owners given their breadth of adoption, their applicability to databases, and how they facilitate collaboration.  The Creative Commons website provides a tool for choosing the appropriate license depending on the purpose of the dataset.

When analysts combine datasets from various sources, the most restrictive license involved in that combination then becomes the license for the enhanced dataset or derivative work. All derivative works that utilize the dataset, even if the dataset is a very small part of the derivative work, are now hampered in their usage by the constraints of that license. Work that involves some datasets from multiple sources often face a complex analysis concerning how different licenses may conflict, restrict, or even prohibit certain types of work output.

Stage 4 – Engage the community of data users

According to the Africa Data Consensus: ‘A data community refers to a group of people who share a social, economic or professional interest across the entire data value chain – spanning production, management, dissemination, archiving and use.’

A data community is likely composed of a broad range of people and entities with differing skill sets, including, for example, large organisations such as non-governmental organizations and government agencies as well as independent researchers, non-technical subject-matter experts, and citizen data scientists. A vibrant community is a force multiplier of an open data programme, creating value through three dimensions:

Feedback

The community can provide feedback on what data they are interested in and details of the metadata and context that would be most useful for them. The community can indicate not only what data to invest in collecting but also how to collect and publish them.

Contribution

Community members can help to clean, annotate, and enhance the data, whether this is improving the data dictionary or building schemas and ontologies that can help contextualise the data within a specific field or topic.

Collaboration

Good data work is inherently social, and the global effort for progress benefits not only from leveraging the work others have done cleaning and prepping the data, but also in the exploratory analysis, visualisation, and other derivative works others have created from those data.

It is important that the community creates a mechanisms to work together efficiently, for example, by naming an owner of a dataset who engages with the community to answer their questions, proactively seek their feedback, and capture their user stories.

Stage 5 – Ensure interoperability

Interoperability is the ability to exchange and use information between systems. Important issues to consider when optimising interoperability are:

Prepare the data

Prepare the data so that they are structured or machine-readable as opposed to unstructured data meant to be read by people. Think about the difference between a word processor document and a spreadsheet. Both might contain statistical data, but users need to read the document to pull data out, whereas they can query the data in a spreadsheet using software.

Use open formats and standards

It is best to publish structured data in open formats and standards, as opposed to proprietary, closed formats. A growing number of open and commercial software programs support open formats and standards. Such software allows consumers of the data to more easily interpret and convert the data within their regular tools. Proprietary formats, on the other hand, often rely on commercial software that consumers would need to purchase or open software based on unpublished specifications, and may have licensing or usage restrictions that make them unsuitable for many projects.

Use tidy data

Use tidy data that provide a standard way to organise data connecting their meaning to their structure – such that a data consumer can easily discover what the columns, rows, and cell values represent. Consider a situation in which an enumerator interviews ten individuals and asked each of them their age, gender, and where they live. A tidy dataset will consist of ten rows (one for each individual) and three columns (one for each variable or type of observation); each cell will contain the value of the corresponding variable (column) for the corresponding individual (row).

Use standard vocabularies, codes, and taxonomies

Controlled vocabularies ensure that multiple observations for the same variable use the same coding system, supporting comparison and aggregation. It is preferable to use a standard code for values that have a commonly understood meaning. Where there are several common taxonomies for a concept, crosswalk data can map values from one taxonomy to another – allowing data using either one to be joined.

Stage 6 – Link data

The possibilities of open health data become most fully realized at the final stage in the progression model when the data are linked. When users link data, they become more interoperable, which in turn significantly improves discoverability and facilitates collaboration.

The health research community was one of the earliest adopters of linked data. The pharmaceutical industry has benefited from creating a body of knowledge around particular drug compounds. DrugBank and RxNorm, for example, link individual drugs to clinical trials, drug-drug interaction data, and manufacturer information. This allows pharmaceutical researchers to see where a new drug may be successfully applied or where dangerous side effects may arise if combined with other medications.

The four principles

Tim Berners-Lee outlined four principles that would maximise the potential of linked data, following similar principles to the World Wide Web:

Principle 1: Use Uniform Resource Identifiers (URIs) as names for things;

Principle 2: Use HTTP URIs so that people can look up those names when they look up a URI;

Principle 3: Provide useful information about the data in standardized ways (RDF and the query language SPARQL);

Principle 4: Include links to other URIs to discover more things.

The four principles have a common purpose: 1) to facilitate the organisation of information; 2) enable linkage to related concepts; and 3) to make it easier for machines and humans to follow those linkages.

Ontologies provide a powerful way of leveraging these linked concepts and the relationships between them. Ontologies extend the idea of using standard identifiers and taxonomies for concepts by modelling the relationships themselves and the logical connections between them.

Return to contents ⇒

Challenges

Data sharing is widely regarded as best practice. But there are many difficulties, particularly in sharing individual health records, for example:

Alter and Vardigan point out there are ‘ethical issues that arise when researchers conducting projects in low- and middle-income countries seek to share the data they produce.’ Concerns relate to ethics of informed consent, data management, and intellectual property and ownership of personal data.

Wyber et al observe ‘sheer size increases both the potential risks and potential benefits of [data sharing]. The approach may have most value in low-resource settings. But it is also most vulnerable to fragmentation and misuse in such settings.’

Kostkova et al acknowledge that whereas the potential of opening healthcare data and sharing big datasets is enormous, the challenges and barriers to achieve this goal are similarly enormous, and are largely ethical, legal and political in nature. A balance needs to be struck between the interests of government, businesses, health care providers and the public.

Significant barriers to global progress are lack of data visibility and poor connectedness among people and institutions seeking to solve similar problems. A sustained open data revolution that lowers these barriers would accelerate collaboration and problem-solving on a global scale. This would provide a key to solving some of the world’s biggest challenges in global health.

Return to contents ⇒

Additional resources

The complete chapter on which we based this page:

Cover of The Palgrave Handbook of Global Health Data Methods for Policy and Practice

Laessig M., Jacob B., AbouZahr C. (2019) Opening Data for Global Health. In: Macfarlane S., AbouZahr C. (eds) The Palgrave Handbook of Global Health Data Methods for Policy and Practice. Palgrave Macmillan, London.

COVID-19 issues

Coronavirus COVID-19

Where are the data ⇒

Other resources

Chignard S. A Brief History of Open Data.

The Open Data Barometer. this site actively tracks and scores the progress and quality of over 100 open data programmes.

G8 Open Data Charter and Technical Annex. In this policy paper published in June 2013 G8  members agree lays to follow a set of five principles as the foundation for access to, and the release and re-use of, data made available by their governments.

How Linked Data creates data-driven cultures (in business and beyond). This white paper describes the potential of linked data and provides tips for its practical adoption.

The IHME Global Health Data Exchange provides a list of country open data sites

Latest publications ⇒

Latest news ⇒