“Although we have more data than ever before, a smaller percentage of that data is available to researchers than ever before.”
Academics, journalists, civil society organizations, and regulators have increasingly been trying to answer a range of different questions about the role of online platforms in enabling, facilitating, and amplifying the spreading of disinformation online. To what extent are states and nonstate actors actually trying to manipulate elections or democratic debates online? What is the exact role of the infrastructure of some of these platforms—such as ad-targeting mechanisms or content recommendation algorithms—in amplifying and disseminating disinformation? And to what extent are any of the voluntary measures that platforms are implementing to counter disinformation actually effective?
Access to platform data is seen by many of these stakeholders as a necessary condition to not only start answering some of these questions but also to—where relevant—formulate policy interventions that would hold platforms accountable if and when they are responsible for facilitating or exacerbating harms caused by disinformation. Importantly, access to data has been highlighted as a key issue by European politicians and regulators about platform governance in general.
However, in general, platforms have been reluctant to provide such data voluntarily. Platforms have at times invoked Europe’s General Data Protection Regulation (GDPR) as a key obstacle that prevents them from sharing platform data, including personal data, with independent researchers. In general, their argument rests on the assumption that the GDPR lacks clarity regarding whether and how companies might share data with independent researchers, which results in a conservative attitude to making such data available, given the penalties that can be incurred for violating the GDPR. Additionally, the release of such data could result in evidence-based criticism, public outcry, and regulatory action, which means there is currently little to no incentive for platforms to hand over such data.
This article argues that there is a pressing societal need for a legally binding mechanism that provides independent researchers with access to a range of different types of platform data. Such a mechanism would tilt the incentive structure toward more disclosure and would correct the enormous information asymmetry between platforms and third parties. It would also present a first step toward an auditing regime for platforms and would provide more legal certainty for platforms to hand over personal data to researchers in compliance with Europe’s privacy laws.
The first draft of this paper was published in November 2020and recommended that the EU create a legal obligation for platforms to make platform data available to independent researchers in order to achieve this goal. Since the publication of that draft paper, the European Commission published its proposal for a Digital Services Act (DSA), which included such a legal provision. Article 31 of the DSA would oblige specific categories of platforms to make data available to independent researchers who act on behalf of a regulator. Such access would be required to monitor and assess the compliance of platforms with new obligations that the EU intends to impose on those platforms in the DSA.
Section I of this article will outline why access to a wide range of platform data for scientific researchers is vital to analyze and mitigate the potential harms resulting from disinformation. Section II will highlight the limits of self-regulation in this context and describe why voluntary mechanisms created by platforms to provide such data have not been sufficient. Section III will highlight how the GDPR currently would allow platforms to share personal data based on consent or legitimate interest, while Section IV argues that the creation of a separate compelled-data-access regime as proposed in the DSA would add a crucial legal framework to secure more legal certainty for both platforms and researchers. However, this framework for compelling platforms to make data available does not specify what the technical, logistical, and legal parameters are for ensuring that such data access protects user privacy and adheres to best practices for ethical and responsible research. A code of conduct on GDPR-compliant access could solve a number of concerns in this area and prove useful as a model outside the EU as well.
I. Access to Data for Scientific Researchers is Vital to Analyze and Mitigate the Harms Resulting from Disinformation
Over the past couple of years, independent researchers whose work aims to analyze, detect, expose, or prevent the creation, facilitation, dissemination, and impact of disinformation online have systematically requested access to a wide variety of platform data in order to move research efforts from primarily descriptive research to causal research.A broad set of actors could benefit substantially from increased access. Journalists, fact-checkers, NGOs, or digital forensics experts perform vital watchdog functions in our democracies, and their ability to improve the public understanding of the challenges of—and potential solutions to—disinformation would be significantly enhanced as a result. In Europe, however, the GDPR carves out specific derogations for “scientific research purposes,” which especially put academic researchers in a privileged position to receive access to platform data.
Different types of platform data
Platform data—or just “data”—is often used as an amorphous all-encompassing term that hides at least three different categories of data, which in the EU would be afforded different levels of legal protection under the GDPR.
The first category consists of predominantly nonpersonal data, which is often proprietary; this can consist of numerical data, metrics, classifiers, or other types of data that relate to the technical functionalities, design choices, or policy decisions of a particular platform. These can include, for instance, statistics about A/B testing;information about microtargeting options used by advertisers; engagement and impression data of watched videos; data that enables network analysis of a video recommendation system; or internal categorizations of specific pieces of content or metrics regarding the enforcement of specific violations of terms of service. This category also includes data about the parameters that determine the ranking, sequencing, rating, or review mechanisms of content, visual highlights, or other saliency tools. In the EU, platforms are already required to outline these parameters “in order to improve predictability” for business users (such as restaurants, travel agents, or shops) to allow them to “better understand the functioning of the ranking mechanism and to enable them to compare the ranking practices of various providers.” The GDPR does not apply to this data insofar as it is not “personal” data.
A second category of data consists of personal data, which according to the GDPR refers to any information relating to an identified or identifiable natural person (‘data subject’).Essentially, personal data is any type of data that relates to a natural person or can reveal the identity of a user of a specific platform, for example, the account information or IP addresses of individual users. The sharing of such data is subject to a number of rules and safeguards.
A third category of data consists of “sensitive” or “special category” personal data, which according to the GDPR refers to the processing of data that reveals someone’s “racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership.”It also includes health data or data “concerning a natural person’s sex life or sexual orientation.” The processing or sharing of such special category data is in principle prohibited, unless an exception to this general principle applies.
Researchers need access to these different categories of data for the purpose of conducting scientific research that can contribute to a number of goals, including, but not limited to (i) the identification of organized disinformation campaigns by state actors; (ii) the identification and understanding of how platform users engage with disinformation and what societal risks this poses, (iii) the identification and understanding of how platforms’ own policies and products can contribute to societal risks posed by disinformation; and (iv) the testing of the effectiveness of voluntary interventions by platforms to counter disinformation on their platforms.
Identifying organized disinformation campaigns
The field of disinformation studies bridges the fields of sociology, media studies, and political science, with more technical domains of data science and cybersecurity. This field “crystalizes around a shared goal of revealing how malicious actors utilize information and communication technologies.”One very specific purpose for which access to data is needed by some experts in this field is to attribute organized disinformation campaigns to specific, well-funded, state or nonstate actors. Data that allows for the attribution of information operations is crucial to formulate proportionate (counter) responses that might deter similar behavior from taking place in the future.
Research that focuses on attribution is essentially an extended field of research for cybersecurity scholars and practitioners, who have been very critical about the lack of data from platforms.The detection of manipulative actors who engage knowingly and covertly in what Camille François dubbed “viral deception campaigns” relies on a “cat-and-mouse game of a) identifying threat actors willing and able to covertly manipulate public discourse and b) keeping those actors from leveraging social media to do so, as they refine their strategies to evade detection.” François, in her briefing for the U.S. House of Representatives Committee on Science, Space, and Technology, highlighted the “dramatic asymmetry of information” between the platforms on whose infrastructure these influence operations play out, and the rest of the world. There she lambasted some platforms’ community standards or terms of service, which she said either “indirectly prevent the type of external research that may lead to detecting and exposing distortive behaviors (e.g., when existing and important safeguards also prevent researchers from collecting the data they’d need to analyze distortive behaviors) or directly seek to prevent it (e.g., with rules explicitly preventing the use of data in order to perform detection of deceptive behavior).”
François received access to nonpublic platform data on behalf of the U.S. Senate Intelligence Committee in order to scrutinize the Russian campaign targeting the American public around the 2016 presidential election. Her conclusion after seven months of intense research, even with that unprecedented level of access, was that “[t]here remain critical data blind spots ... which undermine our preparation for the threats ahead.”James Pamment points out that the security teams at the major platforms already have an “open, trusted channel for sharing intelligence on disinformation leads and threat actor tactics, techniques, and procedures”—in contrast to other business areas, which indicates in his view at least the “feasibility of collaborating on a shared repository of analytics and campaign-wide data for ... the research community.”
Identifying which platform users are engaging with disinformation and why
A recent paper in the Harvard Kennedy School Misinformation Review describes research that could hypothetically be conducted if platform data were more readily available.Researchers describe that they are currently unable to scrutinize the “true reach” of disinformation that is being shared because impression data is not made available. Access to personal data would enable researchers to scrutinize “motivations for sharing misinformation” and enable analyses of “the extent to which people believe misinformation as they share or receive it, as mediated by user characteristics and the sharer-receiver relationship.” Access to special category data could enable researchers to understand “to what degree does misinformation play on emotions stemming from ideological, political, racial, or religious biases” .
Access to such data could also assess the potential societal risks that are the result of exposure to misinformation, for instance, whether such exposure is “negatively associated with less participation in public discussion” or “less involvement in civic and political activities.”In announcing one of Facebook’s latest data-sharing efforts, Nick Clegg, vice president for global affairs and communications, and Chaya Nayak, head of Facebook’s general research and transparency team, admitted that there is a need for “more objective, dispassionate, empirically grounded research.” According to Clegg and Nayak, “We need to better understand whether social media makes us more polarized as a society, or if it largely reflects the divisions that already exist; if it helps people to become better informed about politics, or less; or if it affects people’s attitudes towards government and democracy, including whether and how they vote.”
Identifying how platforms enable, facilitate, or amplify disinformation
Platforms have taken a number of measures in the past five years to counter various symptoms of the disinformation problem, including by updating their terms of service to outline which specific activities are prohibited on their platforms. However, most of these interventions focus on the behavior of “bad” external actors; they rarely address the core enabling infrastructure of the platforms that can facilitate the spreading and amplification of disinformation.
This core architecture consists of a series of built-in optimization rules, engagement algorithms, and incentives structures that are intended features of platforms like Facebook or YouTube. Active design choices are made about this core infrastructure with the primary goal of increasing our engagement with content on the site, which increases the delivery of targeted ads.These features determine access to information, its amplification, and ultimately the reach of specific pieces of content. These are the same features that are used—and gamed—by every influencer or online marketer that wants to push a specific message. “Bad actors” in that sense are often using the same tools that legitimate business users have at their disposal.
This is illustrated by the fact that despite all the voluntary actions to counter COVID-19-related misinformation by the platforms, that misinformation is still rampant—perhaps because content promotion and amplification systems are largely left untouched by these actions.According to a recent study published in BMJ Global Health, over 25 percent of the most viewed YouTube videos about coronavirus contain false or misleading information, which has led to suggestions that COVID-19-related misinformation is being pushed more frequently than verified health content. Another investigation by the Institute for Strategic Dialogue (ISD) and the BBC in 2020 found that websites known to host disinformation about COVID-19 had received more than 80 million interactions on public Facebook pages since the start of the year. As a benchmark, in the same period, links to the Centers for Disease Control and Prevention and World Health Organization websites gathered around 12 million interactions combined. In another area, The New York Times suggested that Facebook’s recommendations systems, designed to prioritize the growth of closed groups, “most likely supercharged the QAnon community—exposing scores of people to the conspiracy theory.”
When evidence emerges about the enabling role of this core infrastructure in facilitating the amplification and dissemination of disinformation, no action by the platform ensues—unless undeniable evidence is made public. When an internal report at Facebook, obtained by The Wall Street Journal, found that 64 percent of people who joined an extremist group on Facebook only did so because the company’s algorithm recommended it to them—no follow-up action ensued.But when The Markup published a story about how Facebook allowed advertisers to target people who were interested in “pseudoscience,” the company removed this targeting option. Hence researchers want access to any relevant platform data, which includes proprietary data but also personal and sensitive data, in order to properly scrutinize the role of this core architecture in amplifying and disseminating disinformation.
Testing the effectiveness of voluntary interventions by platforms to counter disinformation
Another crucial reason disinformation scholars want access to data is to assess the effectiveness and proportionality of specific, voluntary interventions by the platforms to counter disinformation. Facebook’s fact-checking partners, for instance, “don’t know how well their efforts perform at reducing the spread of misinformation.”As Gordon Pennycook and David Rand argue: “Experimental investigation of interventions, rather than implementation based on intuitive appeal, is essential for effectively meeting the misinformation challenge.” Specific interventions that researchers would want to investigate include labeling news headlines with fact-checking warnings and prompts that nudge users to consider accuracy before sharing. Fact-checkers earlier complained that they don’t know the effect of platform strategies, such as reducing the reach of a debunked piece of content or automatically displaying related articles, in dissuading people from sharing a link that contains debunked content. As Michelle Amazeen and Chris Vargo conclude: “Researchers need visibility into these actions to assess how political ideology, media use, and media literacy interact with the steps platforms are taking to correct misinformation.”
A report by the ISD on the activities of social media platforms to counter COVID-19-related misinformation illustrates how platforms have responded quickly to the challenges posed by bad actors, including by setting up information hubs that share verified updates from trusted sources such as the WHO;labeling, downranking and/or removing content flagged as false or misleading by experts; and prohibiting ads that aim to profiteer off the pandemic. Unfortunately, there is no way to independently assess the effectiveness of these interventions. Statistics given by platforms need to be taken at face value, but claimed successes cannot be independently verified. As the ISD notes:
Without better access to data and insight on companies’ decision-making systems, both human- and machine-led, we cannot determine with certainty why some areas of policy appear more effective or better enforced than others. ... Without such data any conclusions drawn about the response of these platforms must rely on some element of extrapolation and inference.
Voluntary mechanisms created by the platforms to provide access to data are not sufficient
Platforms have always made certain types of data available to a limited extent to the research community via public APIs, or application programming interfaces. These APIs have been regarded as “perhaps the most valuable resource that platforms have offered to third-party researchers.”They allow third parties to request machine-readable data in bulk on a range of relevant topics. Twitter, for instance, has for years provided relatively generous access to public tweets in real-time via its open streaming API and to historical data via its search API, which explains why many studies in the field of disinformation disproportionately focus on the use of that platform. YouTube has a similar access regime, which allows the downloading of public YouTube data about channels, videos, and searches. In the wake of the Cambridge Analytica scandal, Facebook shut down Instagram’s public API and substantially reduced the functionalities of Facebook APIs that provided data on public activities on its public events, groups, and pages. Alexander Sängerlaub, for instance, argued that after this reduction he was unable to replicate his own research on the 2017 German federal elections. From October 2019, Facebook also made data available to pre-vetted researchers via its CrowdTangle API. The data available to researchers through CrowdTangle describes aggregated interactions with Facebook and Instagram posts from public pages, public groups, or public people, including engagement data such as the number of user reactions, shares, comments, and comparisons to a benchmark that illustrates overperformance or underperformance for each kind of interaction.
These APIs all provide access to public data sets, which pose limited GDPR concerns. As pointed out above, however, researchers would require access to more granular data to investigate a wide range of topics. Interestingly, from 2018 onwards, these requests started to resonate among political stakeholders in the EU. In general, platforms’ voluntary commitments to make data available to researchers have been characterized by legal and technical snafus. Unilateral changes to earlier APIs with little to no notice to independent researchers have seriously hindered research that would have a clear public interest.Despite the goodwill and genuine efforts from a number of platform representatives and the research community to make data access mechanisms a success, the results have been patchy at best and illustrative of the major power asymmetries between the platforms and the independent researchers.
II.The Limits of Self-Regulation: The EU’s code of practice on disinformation and Facebook’s political ads archive
The EU’s interest in countering disinformation has for a long time focused almost exclusively on countering narratives and false information originating from Russia. This started to change in official documents from 2018 onwards, when the EU announced the creation of a “code of practice on disinformation that would commit online platforms and the advertising industry—among other things—to provide academia with “access to platform data (notably via application programming interfaces), while respecting user privacy, trade secrets, and intellectual property.”This would enable researchers “to better understand the functioning of related algorithms and better analyze and monitor disinformation dynamics and their impact on society.”
However, this broad intention did not survive the drafting process of the code. In the final Code of Practice,the signatories “acknowledged” the importance to “take the necessary measures to enable privacy-compliant access to data for fact-checking and research activities” and to “cooperate by providing relevant data on the functioning of their services, including data for independent investigation by academic researchers and general information on algorithms.” Hence relevant signatories committed to “support good faith independent efforts to track disinformation and understand its impact” and stated explicitly that “this will include sharing privacy protected datasets, undertaking joint research, or otherwise partnering with academics and civil society organizations if relevant and possible.” Relevant signatories also committed not to “prohibit or discourage good faith research into disinformation and political advertising on their platforms.” The signatories only committed to undertake those actions that correspond to the product or service they offer and “their technical capabilities.”
In practice, these promises boiled down to providing access to nonpersonal data via APIs to a limited data set, in this case, data about political ads that was made available via an ad archive. From the three platforms that set up political ad archives (Google, Twitter, and Facebook), Facebook’s example acts as a cautionary tale about the limits of self-regulation in this area.
According to Facebook, the ad library lets people “easily see how many political and issues ads were run in a given country—as well as aggregated advertiser spend and top searched keywords in the Ad Library.”However, while ordinary users could indeed relatively quickly look up individual ads to a certain extent, research access to the library’s data was more problematic.
Seventy-five independent researchers, brought together by the Mozilla Foundation, argued that many meaningful data points were missing from the archive, including targeting criteria, impression data, engagement data, and data about microtargeting.Independent researchers and civil society organizations accused Facebook of “blocking the ability of independent researchers to effectively study how political disinformation flows across its ad platform.” In response, and in private, Facebook has “responded to concerns raised about its ad API’s limits by saying it cannot provide researchers with more fulsome data about ads—including the targeting criteria for ads—because doing so would violate its commitments under the EU’s [GDPR] framework.”
Moreover, the tool Facebook built was so flawed that it was seen as “effectively useless as a way to track political messaging.”Researchers that originally set out to track political advertising ahead of the European elections in May 2019 “instead ended up documenting problems with Facebook’s library after managing to download the information they needed on only two days in a six-week span because of bugs and technical issues, all of which they reported to Facebook.” A study by the French government confirmed the same findings. Crucially, the ad library was missing information about the targeting data that determines who sees the ads. When researchers at New York University built a browser extension that collected both the content of the political ads and the targeting data, Facebook eventually sent a cease-and-desist letter to the researchers. The letter ordered them not only to shut down the browser extension but also to delete all the data they had collected.
It is important to dwell on this topic since the ad library and its API have been touted many times by senior executives as an example of how Facebook was a responsible corporate citizen that could be trusted to fix its own problems. In response to Twitter’s announcement that it would cease all political ads on its platforms, Facebook’s Chief Operating Officer Sheryl Sandberg responded that Facebook does not have to cease political advertising because the platform is “focused and leading on transparency,” explicitly citing Facebook’s ad archive efforts.By leaving it up to the platforms to decide how they should implement their obligations under the code of practice, the access to data regime for researchers remained an empty box.
By January 2019, the European Commission itself started to admit that it was “deeply concerned by the platforms’ failure to provide specific benchmarks to measure progress, by the lack of detail on the actual results of the measures already taken and lack of detail showing that new policies and tools are deployed timely and with sufficient resources across all EU Member States.”In March 2019, European Commissioner Julian King “urged the platforms to do more to improve independent scrutiny” and to ensure that the platforms are “not just marking their own homework.”
In its final assessment of the code of practice, the commission singled out the lack of access to data as “a fundamental shortcoming” of the code.The provision of data and search tools required to detect and analyze disinformation cases was seen as “episodic and arbitrary,” and did “not respond to the full range of research needs.” The voluntary nature of existing data-sharing mechanisms was seen as insufficient and hence “a more structured model for cooperation between platforms and the research community should be developed.”
The limits of data access partnerships
To its credit, Facebook has been experimenting with more bespoke and targeted schemes for scholarly data access. The most prominent one is Social Science One, which was announced in April 2018 as a new partnership between independent researchers and Facebook “to study the impact of social media on democracy and elections, generate insights to inform policy at the intersection of media, technology, and democracy, and advance new avenues for future research.”This “new approach for industry-academic partnerships” has been pioneered by Gary King (Harvard) and Nate Persily (Stanford), who decided to set up a trusted “third party”—Social Science One—whose members consist of a commission of senior academics who came to an agreement with Facebook on the scope of a research project.
Social Science One’s press release in July 2018 initially promised the release of “about a petabyte of data with almost all public URLs Facebook users globally have clicked on, when, and by what types of people, including many links judged to be intentionally false news stories by third party fact checkers.”Ultimately, in September 2019, it released a limited data set of 7 gigabytes, comprising 32 million “misinformation” URLs that have been shared on Facebook. The long delay resulted in significant criticism from the original funders of Social Science One, who expressed frustration that the 83 independent scholars whose proposals were selected for funding received “access to only a portion of what they were told they could expect, and this has made it difficult or, in some cases, impossible for them to complete the approved research.” A year and a half after the initial announcement of Social Science One, in December 2019, its co-chairs and its European Advisory Committee stated that “Facebook has still not provided academics with anything approaching adequate data access.” They argued:
The current situation is untenable. Heated public and political discussions are waged over the role and responsibilities of platforms in today’s societies, and yet researchers cannot make fully informed contributions to these discussions. We are mostly left in the dark, lacking appropriate data to assess potential risks and benefits. This is not an acceptable situation for scientific knowledge. It is not an acceptable situation for our societies.
The reasons for the delay were twofold. Initially, Social Science One thought that researchers could do their research using Facebook’s systems, but according to Persily, its co-chair, “The company did not have structures that could be readily adapted to give parties access to specific data.”Hence a customized “unprecedented research infrastructure” had to be built from scratch in order to secure the privacy of the data sets, using differential privacy techniques.
The key rationale put forward by Facebook, however, was that they were constrained by restrictions in both the GDPR and the “consent decree” it operates under with the U.S. Federal Trade Commission.Facebook argued that those restrictions “prevent researchers from analyzing individual level data, even if de-identified or aggregated.” Despite these being the key rationales to prevent access to such data, no detailed reasons were provided at that time, which led the Social Science One co-chairs and its European advisory committee to urge the “major digital platforms” to offer formal, written analyses of any legal barriers they claim prevent them from providing access for academic research, including with regards to the European Union’s [GDPR].” The academics argued that the GDPR supported “a more permissive interpretation with respect to academic data sharing for public good.”
The end result was a compromise, where Social Science One complied “with Facebook’s interpretation of the applicable privacy laws,”which resulted in applying differential privacy techniques to the URL data set. This introduced statistical noise and censoring into the data set, which meant that the usefulness of the research was significantly curtailed. Despite the good faith efforts of many people involved in this project, who genuinely wanted to make it a success, the history of Social Science One is an example of the limits of voluntary data-sharing approaches between platforms and researchers, where access to data is determined exclusively by the interpretation of the GDPR by one dominant platform.
Most recently, Facebook introduced a U.S. 2020 election research initiative,which pairs a group of outside scholars with internal Facebook employees to study Facebook’s and Instagram’s impact on four key outcomes that have dominated public and academic attention: political participation, political polarization, knowledge and misperceptions, and trust in U.S. democratic institutions. Only Facebook employees will be able to “touch” and analyze the raw data, but “both teams will work together to devise appropriate monitoring systems for assuring the scientific integrity of the research.” This is an important, and promising, new development, but nevertheless there are still limitations to this voluntary approach. Most importantly, Donovan has highlighted how these arrangements do not solve the “unevenly distributed access” to data for researchers. She highlights how this creates “several problems for the advancement of social science, including fracturing development of a shared replication methodology, thus making evaluation of findings impossible. It also encourages secret transactions between researchers and companies, contradicting the scientific principles of openness, public good, and peer-review.”
III. Does the GDPR Prevent Platforms from Providing Access to Platform Data to Independent Researchers?
Until now, this paper has highlighted the valid reasons that independent researchers have to request access to platform data, including personal data. These requests have been acknowledged and supported by various officials and bodies of the EU equally interested in having access to platform data to understand platforms’ role in facilitating and amplifying disinformation, as part of a new regulatory framework. Partly as a result of this political pressure, some platforms have provided voluntary access mechanisms to provide access to a limited subset of predominantly public or anonymized platform data, including via public APIs, or data-access grants and partnerships. These mechanisms are not seen as sufficient, both by researchers and European policymakers, for a variety of reasons.
One crucial reason a number of platforms have invoked for this limited appetite to expand the sharing of (personal) data is their obligations under the GDPR, which contains a basic premise: Personal data can only be processed by a “data controller” if the processing activity falls under one of six legal categories.In this case, the data controllers are the platforms that organize, collate, and determine the means and purposes of processing. By disseminating, transmitting, and making such data available to independent researchers, those data controllers will be engaging in an act of processing.
A very strict interpretation of these obligations by the platforms argues that the GDPR would not allow the sharing of personal data from the platform to a researcher, while a more nuanced interpretation argues that the GDPR would not allow the sharing of personal data without taking precautions that in practice result in platforms only making available anonymous data or aggregated data. Whether those interpretations are true or not is subject to valid arguments on both sides. However, the fact that it requires an argument is in itself a barrier to hand over data. This uncertainty is undesirable for both researchers and the platforms.
This paper argues that the GDPR currently would allow platforms to share personal data based on consent or legitimate interest, and in the future, Article 31 of the DSA would allow the sharing of such data, and would in fact compel the sharing of such data for “vetted” academic researchers for the “purpose of conducting research that contributes to the identification and understanding of systemic risks,” such as the dissemination of illegal content; inauthentic, automated, or otherwise intentional manipulation; and “negative effects” on the “exercise of fundamental rights.”
Consent as a legal basis to provide access to personal data
Personal data could be shared with independent researchers if the data subject has given his or her consent to the platform to share their personal data.Consent under the GDPR has specific requirements and a high threshold to be valid. The GDPR defines consent as “freely given, specific and informed, and granted by an unambiguous affirmative action.” That definition is augmented by specific “conditions for consent” in Article 7 of the GDPR. Note that one of the main bases for processing special category data is “explicit” consent.
In addition to the threshold difficulties associated with consent, relying on consent could pose a number of other difficulties for platforms. For example, including a clause in the nonnegotiable terms of service of a platform that states that “by using this platform you agree that your data will be shared for research purposes” is unlikely to be “freely given” according to the GDPR, due to the power imbalance between the user and the platform, and hence a platform would likely not be able to rely on this form of consent as a legal basis to share personal data with an independent researcher.Moreover, Article 7(2) of the GDPR means that consent cannot be bundled with other terms, and Article 7(3) of the GDPR means the data subject must be able to withdraw consent as easily as it was provided.
Furthermore, users of a specific platform would need to fully understand in advance the purpose of the research in order for their consent to be “specific and informed.” This isn’t always feasible—especially in the sort of big data research that characterizes the field of disinformation scholars.Research usually happens after the data is collected, which makes it hard to provide “informed” consent a priori. The GDPR accommodates this reality of academic research via the research exemption in Article 5.1.b, and users could in principle also consent to a platform sharing their data for more general research purposes or on an ad hoc basis, where a platform asks a selection of their users whether they would want to contribute their data for a specific and well-delineated research project.
The practical implications of relying on consent, however, are not trivial. Researchers would need to be able to accommodate requests of users who withdraw their consent—which under the GDPR should be possible at any time during the research, and in a way that is as easy as giving consent.This can have a significant impact on the research process and its conclusions. Requesting consent can also reduce the ultimate utility of a data set for researchers, since users that opt into a study may not be representative of the relevant population the researcher is intending to study. In addition, users who consent in this way may change their behaviors precisely because they know those behaviors are being studied.
Legitimate interest as a legal basis to provide access to data
Personal data could also be shared for the purposes of the legitimate interests pursued by the platform or by the researcher.In a 2017 case, the Court of Justice of the EU held that if a controller (in this case, the platform) wants to rely on legitimate interests as a legal basis, it has to satisfy a challenging three-part test.
First, the platform has to demonstrate that the interest that is being pursued is legitimate. Carrying out scientific research is considered to be a legitimate interest,but this interest must be “sufficiently clearly articulated to allow the balancing test to be carried out against the interests and fundamental rights of the data subject.” It could be argued that it would be difficult for platforms to balance these interests without having detailed knowledge in advance about the research purpose, aims, and analytical methods of a specific research request. However, requiring this information from a researcher in advance can jeopardize the independence of the researchers, and the legitimacy of any resulting study, since a platform would in theory be able to deny access if it does not like a particular topic of a research project.
Even less desirable, the platform would also need to show that a specific research methodology is “necessary” to achieve those aims.
Finally, the balancing test would require the platform to analyze to what extent the processing activity of the researcher might pose a risk to the fundamental rights of the data subject,which would require inter alia knowledge of the procedural, contractual, and other safeguards that researchers are planning to put in place to safeguard the personal data and identities of the data subjects.
In summary, platforms could rely on legitimate interest as a legal basis to transfer data, but relying on this legal basis requires measures that could result in platforms acting as de facto gatekeepers that are able to decide on the validity of specific research proposals and methods, which would be an undesirable situation. This concern also applies to reliance on consent as a legal basis for platforms to provide access to data, since the platforms would still need to assess the necessity and legitimacy of the purpose for the research design.
IV. How the Digital Services Act Creates a New Legal Obligation for Platforms to Hand over Data to Researchers
While consent and legitimate interest are two legal mechanisms that platforms could use to provide access to data to researchers, both have problems. This section looks at another potential legal mechanism: A platform would be able to share data with independent researchers if this sharing is “necessary to comply with a legal obligation” to which the platform is subject in EU law or a national member state law.There is currently no such EU-level legal obligation that would force platforms to share (personal) data for general research or auditing purposes, but the EU has envisaged such an obligation in Article 31 of the DSA. Article 31.2 states that “upon a reasoned request from the Digital Services Coordinator of establishment or the Commission, very large online platforms shall, within a reasonable period, as specified in the request, provide access to data to vetted researchers for the sole purpose of conducting research that contributes to the identification and understanding of systemic risks as set out in Article 26(1). Let’s unpack what this means.
The DSA will regulate the obligations of digital services that act as intermediaries in their role of connecting consumers with goods, services, and content. As such, it updates the e-Commerce Directive, which dated from 2000. According to the European Commission, these new rules are “an important step in defending European values in the online space,” which will contribute to “setting a benchmark for a regulatory approach to online intermediaries also at the global level.”The proposals from the European Commission provide the starting point for both the European Parliament and the European member states to adopt legislation at the EU level. As co-legislators, they will first amend the proposals along with their preferences before agreeing on a compromise text. This procedure is expected to last up until the summer of 2022 at the earliest.
The supervised risk assessment approach of the Digital Services Act
The most innovative part of the DSA proposal can be found in Chapter III, where the EU introduces a range of new due diligence obligations that are adapted to the type and nature of the intermediary service concerned. The proposal sets up a “supervised risk management approach,”in which certain substantive obligations are imposed only to “very large platforms,” which have an average of 45 million active users in the EU, and which due to their reach have acquired “a central, systemic role in facilitating the public debate and economic transactions.”
The European Commission argues that once a platform reaches this threshold, the systemic risks it poses can have a disproportionately negative impact on our societies given their reach and ability to facilitate public debate and disseminate information online. The European Commission argues that in the absence of effective regulation and enforcement, these platforms can design their services to “optimise their often advertising-driven business models,” without “effectively identifying and mitigating the risks and the societal and economic harm they can cause.”
Hence, the European Commission imposes a duty on these platforms to conduct risk assessments on the systemic risks stemming from the functioning and use of their services, as well as potential misuses by the recipients of the services, and then take appropriate risk mitigating measures.The proposal lists three broad categories of systemic risks, including (a) the dissemination of illegal content through their services, (b) any negative impact of the services on the exercise of fundamental rights, and (c) the intentional manipulation of their services that can have a foreseeable negative effect on a range of public policy goals. These include “the protection of public health, minors, civic discourse, or actual or foreseeable effects related to electoral processes and public security.” After having identified those risks, platforms should deploy “reasonable, proportionate and effective” means to mitigate those risks. These include a broader range of actions than removing content and can include changes to algorithmic recommendation systems or discontinuing advertising revenue for specific content.
Enforcement of the DSA will work on three different levels. The default rule is that ensuring adequate oversight and enforcement should, in principle, be attributed to the member states. However, the European member state in which the “main establishment” of the provider of the intermediary services is located, shall have jurisdiction over the due diligence obligations for platforms, which would typically be an Irish regulator. This regulator is called the “digital services coordinator of establishment.”Where systemic risks emerge across the EU posed by “very large online platforms,” the proposed regulation provides for supervision and enforcement at the EU level, mainly via the European Commission.
These risk assessments and mitigating measures are subject to independent audits that can assess the effectiveness of these measures. Where the outcome of the audit is not positive, operational recommendations will be made regarding specific measures to achieve compliance with the company’s obligations under the DSA.Ultimately, if a platform doesn’t sufficiently address these recommendations, the European Commission may further investigate this as a potential infringement of the DSA, impose interim measures, and ultimately impose fines of up to six percent of a company’s total turnover in the preceding financial year. Supplying “incorrect, incomplete or misleading information in response to a request” can lead to fines not exceeding one percent of the total turnover in the preceding financial year.
Mandatory access for vetted researchers
The European Commission acknowledges that independent researchers can provide a crucial role in helping auditors, the “digital services coordinator of establishment,” and the European Commission with their respective oversight roles to supervise whether platforms comply with their obligations under the DSA. Hence, the DSA provides a framework for “compelling access to data from very large online platforms to vetted researchers.”
Upon a “reasoned request” of the commission or the digital services coordinator of establishment, very large online platforms “shall provide access to data to vetted researchers ... for the sole purposes of conducting research that contributes to the identification and understanding of systemic risks as set out in Article 26(1).”Researchers need to fulfill four main criteria in order to be “vetted” and be part of a pool of researchers that can undertake research on behalf of these actors. They need to: (i) be affiliated with academic institutions, (ii) be independent from commercial interests, (iii) have proven records of expertise in the fields related to the risks investigated or related research methodologies; and (iv) commit and be in a capacity to preserve the specific data security and confidentiality requirements corresponding to each request. It is not clear how such a vetting process would work in practice, for instance through a professional licensing regime or via certification bodies.
The DSA includes a nonexhaustive list of three categories of data that should be provided by platforms,through online databases or APIs, including:
- the data necessary to assess the risks and possible harms brought about by the platform’s systems,
- data on the accuracy, functioning, and testing of algorithmic systems for content moderation, recommender systems, or advertising systems, or
- data on processes and outputs of content moderation or of internal complaint-handling systems within the meaning of this regulation.
Article 31 is crucial for the EU’s supervised risk management approach for three reasons: (1) researchers can act as a counterweight to the platforms’ own risk assessment analysis, (2) researchers can act as pathfinders that dig up evidence of systemic risks, and (3) researchers can assess the effectiveness of platforms’ proposed risk mitigating measures.
Researchers would get an unprecedented level of access to data, but the proposal takes into account legitimate interests of the platforms as well. A platform can ask to amend the request to provide data if it does not have access to the data or when giving access could in its view “lead to significant vulnerabilities for the security of its service or the protection of confidential information, in particular trade secrets.”However, given the crucial role for pre-vetted researchers to identify systemic risks that need to be mitigated, and their role as independent information providers to auditors, this derogation is overly broad. Recital 60 of the DSA states explicitly that very large online platforms should give the auditor “access to all relevant data necessary to perform the audit properly” and the auditors should also “be able to make use of other sources of objective information, including studies by vetted researchers.” It further says that auditors should “guarantee the confidentiality, security and integrity of information, such as trade secrets.” Pre-vetted researchers must live up to the same standards, and their vetting process should be conditional upon their ability to live up to those standards, but security reasons and trade secrets should not be a ground for a platform to refuse access to data a priori.extra reasons for platforms to refuse access to data for researchers as compared to auditors.as independent information providers to auditors as independent information providers to auditors their role as independent information providers to auditors
The proposal also states that all requirements for access to data under this framework should be “proportionate and appropriately protect the rights and legitimate interests, ... including the recipients of the service.”As Donovan points out, “researchers can unintentionally expose research subjects to a range of harms, including identity theft, financial fraud, harassment, abuse, or reidentification.” Potential GDPR violations can also be added to this list. This is where potential GDPR concerns about sharing of data enter the DSA framework. The European Commission stated in its European Democracy Action Plan from December 2020 that “the GDPR does not a priori and across the board prohibit the sharing of personal data by platforms with researchers,” and opened the door in the DSA to solve these challenges at a later stage. It further announced that it would adopt a separate delegated act to “lay down the technical conditions under which platforms should share data and the purposes for which the data may be used. This delegated act will elaborate the “specific conditions under which such sharing of data with vetted researchers can take place” in compliance with the GDPR.
One point of criticism to this proposal is that it limits access to data to academic researchers. From a GDPR-perspective, it does make sense to limit access to data to scientific researchers, but this does not necessarily need to be translated as academic researchers. This is one issue that can be solved potentially through a code of conduct on GDPR-compliant access to platform data.
Establish a code of conduct on GDPR-compliant access to platform data to facilitate sharing of data
The DSA would provide a much-welcomed legal basis for platforms to hand over access to platform data to academic researchers. However, this framework for compelling platforms to make data available does not solve the question of how researchers could get access outside this specific framework, without the intervention of the European Commission or the digital services coordinator of residence. Also, as the Commission already acknowledged, establishing this legal basis is only a first step to enable GDPR-compliant access to data.
Once that legal basis is known, the platform — in its role as controller — also needs to take into account a number of key data protection principles before it can share data. These principles include that the platform has to ensure that the data is shared in accordance with the purposes for which they were collected (“purpose limitation”),ensure that technical and organizational measures are put in place in order to respect the principles of data minimization, accuracy, storage limitation, and preservation of the integrity and confidentiality of the data. The platforms also would need to adhere to a number of transparency obligations.
The drafters of the GDPR implicitly conceded that adhering to all these principles might be difficult in a research context, and hence introduced a number of derogations to some of these principles. Importantly, the sharing of special category data may be allowed if this is seen as necessary for archiving purposes in the public interest, scientific, or historical research purposes, if the processing complies with the requirements imposed by Article 89 (1) of the GDPR.Additionally, there are three other crucial derogations: (1) the GDPR presumes that sharing data for research purposes is considered to be compatible with the initial purposes of data collection, (2) personal data used for research may be exempt from the exercise of individual rights, such as the right to request access to your data, and (3) platforms are exempt from providing data subjects with specific pieces of information.
While the GDPR sets out the general principles and rights that need to be respected by data processors and data controllers, it does not spell out in detail how these rights and principles should be applied in very specific contexts. Neither does the DSA. To solve this situation, the GDPR has encouraged the creation of codes of conduct under Article 40 of the GDPR, where associations and other bodies representing categories of controllers or processors can “contribute to the proper application” of the GDPR, “taking account of the specific features of the various processing sectors.”Such a code should aim to codify how the GDPR shall apply in a “specific, practical and precise manner.”
Associations of academics in a specific field, trade associations that represent platforms, consortia of associations, or any other body that represents a relevant sector in this field may (jointly) prepare such codes; they become the “code owners.”Together these actors can develop and agree upon technical standards, logistical protocols and joint interpretations of the GDPR that demonstrate how both the platform and the researcher comply with specific parts or principles of the GDPR. In specifying the application of the GDPR provisions to a processing activity or a sector, the code should provide sufficient added value by using, for example, sector-specific terminology and offering use cases and best practices.
The European data protection supervisor (EDPS) encouraged the adoption of a code of conduct for facilitating access by researchers. The EDPS specifically added that such a code can include “the provision by private companies, particularly tech platforms, of data to independent researchers for specific projects, such as examining online manipulation and the dissemination of misinformation.”Similarly, the European Commission has been keen to support “an effective data disclosure for research on disinformation ... by developing a framework in line with applicable regulatory requirements and based on the involvement of all relevant stakeholders (and independent from political influence).”
Importantly, the drafters of this code need to submit a draft to the relevant national data protection authority,which needs to approve it, provided that it complies with the GDPR. For transnational codes, which cover data processing activities in the territory of multiple EU member states, the competent supervisory authority must subsequently refer the code to the European Data Protection Board (EDPB) for a decision, and from there, it proceeds to the European Commission, which can decide that the approved code of conduct has general validity within the EU. The code of conduct then ends up as an implementing act in EU law.
Monitoring adherence to the code: Limited mandates vs. broader mandates
One of the most challenging aspects of establishing a code in this context is that Article 41 of the GDPR requires the establishment of a monitoring body, which scrutinizes the compliance of the platforms and research associations with the code and which carries out reviews of the code’s operations. The EDPB specifies that this body needs to have at its disposal effective oversight mechanisms, including “random and unannounced audits, annual inspections,”“reporting requirements, clear and transparent complaint handling and dispute resolution procedures, concrete sanctions and remedies in cases of violations of the code, as well as policies for reporting breaches of its provisions.” Importantly, the monitoring body can sanction those members who break the rules of the code.
Such a body can have a limited function, and only act as a monitoring body, but in theory, it could also be part of a larger organization with a broader mandate. Such “trusted third parties” could provide a secure environment in which companies can make personal data available to trusted—or vetted—third parties for further processing for research purposes. These don’t necessarily need to be limited to academic researchers. The Belgian implementation of the GDPR allows recourse to such a trusted third party if it is independent from both the initial and the subsequent controllers.Similar mechanisms also exist in the U.K. and in France, where the Centre for Secure Access to Data focuses on organizing and implementing secure access services for confidential data for nonprofit research, study, evaluation, or innovation.
Jef Ausloos and others correctly identify the strengths of such an independent institution, which can act as a bridge between those holding the data and those wishing to get access to it. Such a trusted third party can act as a “neutral arbiter” in deciding on requests for confidentiality from the disclosing party and in periodically auditing disclosing parties to verify the accuracy of disclosures.”It can “maintain relevant access infrastructure” and “verify and pre-process corporate data in order to ensure it is suitable for disclosure.” Ultimately, it could even play a role in “ensuring otherwise unavailable or uninterpretable data to be made accessible” because “the fact that data is not readily available and/or produced by the respective platforms, should not be a reason to discard including that data into the access regime.” In the platform context, Ausloos and others suggest that a “more centralized EU-level institution” could be advisable given “the political-economic power and multinational dimensions of platform operators.”
Topics that could be clarified in a code of conduct on access to platform data
Currently, platforms are reluctant to hand over personal data, let alone special category data, to independent researchers because they perceive a number of uncertainties related to the research exemptions of the GDPR in particular. In the following paragraphs, this essay will attempt to highlight five topics which could be clarified in a code of conduct.
Provide clarity on which actors can qualify as researchers
One obstacle that some platforms have brought up in discussions is that they should not be placed in a position to define whether a specific activity qualifies as “research” under the GDPR. The GDPR does not explicitly define research but simply states that it should be interpreted “in a broad manner,” including, for example, “technological development and demonstration, fundamental research, applied research and privately funded research.”The EU has always used a broad interpretation of the notion of research, illustrated by Article 179 of the Treaty on the Functioning of the European Union that specifies that “the Union shall have the objective of strengthening its scientific and technological bases by achieving a European research area in which researchers, scientific knowledge and technology circulate freely.”
Both the EDPS and the EDPB have clarified that the application of the special data protection regime for scientific research applies to research where “relevant sectoral standards of methodology and ethics apply,”or where a research project is “set up in accordance with relevant sector-related methodological and ethical standards, in conformity with good practice.” The code could lay out what those specific methodological and ethical standards are, and clarify whether affiliation with an academic institution is a prerequisite to meet those standards. This is not only relevant from a GDPR perspective but from a broader equity perspective. The current voluntary mechanisms to share data tend to privilege established actors working in this space, which creates asymmetries among institutions and researchers that are likely to only increase over time.
Provide clarity on safeguards to receive platform data …
Under Article 5(1)(b) of the GDPR, researchers can receive data that were initially collected by the platform for different purposes, in order to subsequently use these for research. However, to benefit from this relaxed rule the receiving researcher needs to implement “appropriate safeguards” to protect the rights and freedoms of the data subject.According to the EDPS, this doesn’t provide “a general authorization” to further process data in all cases for historical, statistical or scientific purposes. Instead, “each case must be considered on its own merits and circumstances.” Some platforms have argued that they don’t want to be the arbiter that decides which types of technical, organizational, procedural, or contractual safeguards should be implemented in a research project before data can be shared. This is a legitimate concern. For example, the EDPB has specified that in scientific research, data minimization can be achieved “through the requirement of specifying the research questions and assessing the type and amount of data necessary to properly answer these research questions.” Platforms shouldn’t be making that assessment as it could have an impact on the independence of the research.
… And send platform data
Anonymization of personal data is encouraged by the GDPR as a way to mitigate risks for the data subject before data is shared.Some platforms have argued that the guidance from the Article 29 Data Protection Working Party on anonymization techniques contains contradictions and is confusing. On the one hand, the document sets a very high bar to achieve anonymity; it states that data are seen as anonymous data if the process to strip the data of identifiable elements is “irreversible.” At the same time, however, it also allows for a lower bar, suggesting that an anonymization process “is sufficiently robust” if “identification has become reasonably impossible.” A code of practice could clarify how these two standards could be seen as mere examples of how anonymization can be achieved, or can highlight which anonymization techniques are best suited for specific research purposes.
Clarify transparency requirements for data subjects
The GDPR requires platforms to provide their users with relevant information about the purposes of the processing of their data and the recipients of any personal data collected.Despite an explicit GDPR exception to that principle when further processing is for research purposes, some platforms are uncertain that these provisions apply to them. A code of conduct could clarify the extent, timing, and scope of the information about the research protocol that needs to be given to both the data subjects and the broader public. Moreover, the code could cover the extent to which data subjects can exercise their rights in the context of the potential limitations of those rights.
Clarify the application of data subject rights
One of the most challenging aspects related to the GDPR, in general, is that it has left quite some discretion to the member states of the EU to implement the law. Different member states have different rules regarding how special-category data and personal data are allowed to be processed for research purposes.The research exceptions that exclude the application of some of the data subject rights, such as rights to access or rectification, can also vary from member state to member state. This means in practice that personal data that was made available by a platform for research purposes could be subject to different regimes for individuals to exercise their rights, depending on which member state law applies. The code could provide guidance or best practices on this topic as well.
The EU is a trailblazer in proposing a legally binding obligation for platforms to provide access to platform data to independent researchers as part of its ongoing legislative efforts to establish additional responsibilities for online platforms. Such an obligation is necessary to understand the role and impact of platforms in our societies, which in turn can enable evidence-based policies to address perceived systemic risks to our societies. The need for greater access to data is illustrated in this essay by focusing on one specific key policy discussion: the role of social media platforms in enabling, facilitating, and amplifying disinformation.
However, this framework for compelling platforms to make data available does not solve the question of how researchers could get access outside the specific framework of the DSA, and more guidance is needed to allow the sharing of data in a privacy-proof manner. The adoption of a code of conduct on access to platform data in the EU could provide more legal clarity in this area, as it would facilitate the interpretation of the GDPR in this specific context for both researchers and platforms. It has the additional benefit that its work can have a potential impact for researchers outside the EU, as it can provide a source of inspiration to the broader field of disinformation researchers. Finally, the European Commission’s approach—building data-access requirements into broader platform transparency measures—holds significant potential in the U.S. as well. This strategy is flexible enough to allow data access provisions to be incorporated into almost any federal legislative package related to digital platforms—from data privacy to election integrity, cybersecurity to online advertising legislation. While the EU code of conduct is based on an article within the GDPR, which is without parallel in the U.S., the U.S. has a long-standing tradition of incorporating independent codes of conduct into federal and state regulatory frameworks.
The author is grateful to Alex Abdo, Jef Ausloos, Cristina Blasi Casagran, Jameel Jaffer, Julian Jaursch, Amy Kapczynski, Laureline Lemoine, Paddy Leerssen, Claudia Prettner, and Rebekah Tromble for helpful guidance, feedback, and conversations on this topic and early drafts. Special thanks to Ravi Naik and Daphne Keller for providing detailed comments on earlier drafts. All errors remain my own. This research was financially supported by the Mozilla Foundation and Reset.
© 2021, Mathias Vermeulen.
Mathias Vermeulen is the public policy director at AWO, a new data rights agency, and an affiliated researcher at the Centre for Law, Science, Technology and Society at the Vrije Universiteit Brussel.