Research Articles

Ethics in Data Mining

Written by Said Zazai

Introduction

IT is taking many forms: laptops, smart phones, Internet, cloud gaming, mobile phone applications…. The daily life of individuals is thus increasingly depending on technological artefacts. What they don’t know is that such artefacts are continuously recording, communicating, synthesizing and organizing any data judged useful about them (Payne & Landry, 2004). Every action you take is generating a trail of data that is collected and stored through the use of IT. Financial transactions, visited internet pages, GPS positions… there are many examples of daily generated data. As this gathering process is becoming systematic and omnipresence, moral considerations are popping up: Who owns the (or my) data? Who controls it? Where is it being stored? How is it being analyzed? For which purpose is it going to be used? To which extent is the data and its analysis accurate? What power has the customer left over his own data? Will private data become public? Such questions related to the aggregation, access and control of data will be at the heart of moral challenges surrounding the use of information technology.

In the recent past, big data has grown exponentially in structured and unstructured formats. Organizations of all sizes and all industries have realized the importance of creating, collecting, recording, storing and extracting meaning of data in all shapes and formats. Big data is a new phenomenon, only about a decade old but the variety, volume and speed (velocity) characteristics (Davis & Patterson, 2012) of this massive data has made major stakeholders of data to invent and innovate ways to make sense of this data that has been collected and is growing. Data scientists believe that almost 90% of the data generated and collected so far has been done in the last two years only (Conway, 2012). But the characteristics, impact and growth of big data are not the issues that I want to discuss in this paper. The issues that I want to focus on arise in the process of data mining and in the type of data that is being mined. Majority of the big data comprises of transactional data (Davis & Patterson, 2012), which is the data about entities (people, products, services etc.) and that is when it becomes important to monitor how data is being mined, why is it being mined and what conclusion and predictions have been made out of it.

Definitions

Ethics is often mistaken for legal, since both have recourse to the “right and wrong” concepts. Yet there is a crucial distinction between the two notions: whereas the law defines what you have the right to do, ethics defines what is right to do. Ethics not only underpins the law, but supplements it.

Data mining is an analytic process of identifying hidden patterns and/or systematic relationships within data. The ultimate goal of data mining is prediction and predictive data mining is the most common type and the one that has the most direct business applications.

Data mining process involves business understanding, data understanding, data preparation, modeling building, testing and evaluation, deployment and finally business decision making (Stylianou, Winter, Niu, Giacalone, & Campbell, 2012). For simplicity these could be distributed into three stages i.e. Collect, Manipulate and Present.

The legislatures

From legal perspective, most countries have not formalized legislature that will prescribe regulations regarding the collection and use of individuals’ personal data. In the US, Obama’s Consumer Privacy Bill of Rights and Federal Trade Commission (FTC) provides protection to consumers’ data and calls for “privacy by design”, “simplified choice” and “greater transparency”. European Parliament, the Council and European Commission has drafted a regulation by the name of General Data Protection Regulation (GDPR) which provides protection to individuals’ data within the European Union (EU). It also addresses the export of personal data outside of the EU. The regulation was adopted on April 27, 2016 and will enter application on May 25, 2018.

The Analysis

Data mining is not a technology, it’s a set of processes performed to achieve business objectives (Turban, Sharda, Delen, & King, 2010). Unfortunately, no state has placed any laws on data mining processes. Data mining processes haven’t been defined as legal or illegal, neither they can be identified as one entirely but the outcomes of these processes might cross the boundaries of issues like individual’s privacy, security and discrimination, at least. Individual laws do exist to address the issues mentioned but there might be some outcomes of data mining that are harmful but not declared illegal for example social profiling, stereotyping, information diffusion, de-individualization etc. I will discuss each one of the major ethical issues of data mining in greater detail and understand the business and social implications it might have.


Exhibit 1: The three stages of data mining process

Social profiling: The process of customer profiling lies at the heart of data mining. In order to better understand its customers’ behaviors, businesses use the data collection to draw social profiles of customers.  This enables them to better understand which product or service attracts which segment of customers, and thus design a targeted marketing offer. Data mining uses data for classification, which will then lead to prediction. A customer profile is based on two types of information: factual data (who the customer is) and transactional data (what the customer does). Creating customers’ profile through the use of organized data has unleashed several ethical dilemmas, which emerges at different stages of the process.

Privacy, Consent & Ownership: Privacy, consent and ownership are three interrelated concepts when it comes to the users’ data. Privacy is generally defined as “the state of being free from public attention” (Oxford, 2014). There is thus a distinction between the private and the public sphere. The border between the two spheres is considered violated when an individual’s ownership of its data is breached and publicly exposed without its consent.

In data mining, the type of ownership that is considered to be violated is the private information of the customer. Whether used by private companies or by public institutions, the information of individuals are often gathered and utilized in ways to which they did not explicitly consent (Landry & Payne, 2012). As a matter of fact, the data are most of the time not directly supplied by the customer.

In the private sector for example, where companies are using personal identifiable information (PII) to draw purchasing profiles, the customer is rarely aware of the entire process of collection and transfer of his own personal data. With the revolution of technology, it is now possible to have access to private information such as a person’s interests or purchasing preferences. Each time you visit an internet page for example, a cookie is created in your computer in order to keep track of your website preferences. The cookie is linked to an identification number, which is going to be stored in companies’ databases. At no point during the process is the customer aware that some of its private information is used for another entity purpose (Atterer, Wnuk, & Schmidt, 2006). ). Data gathering by companies is thus fundamentally problematic in terms of privacy concerns. You have no control of when, how, where and by whom is your own information collected. But as we have seen, the collection process is only the first step of the data mining profiling process. The use of this data, the second step, represents also a large issue in terms of privacy. For what purpose will my private information be used? How can I control their use? How can I have the assurance that it will not be publicly disclosed? Companies often argue that the data are gathered in a public sphere (Internet according to them), and that they will only be used internally (to assign the customer to a specific purchasing segment. But even in that case, ownership and consent were fundamentally violated, and this represents a huge ethical issue. It may be legal, but it seems very unethical. It is a perfect illustration of the split between law and ethics.

The privacy issue is not limited to the use of data mining by businesses in an attempt to draw consumption profiles. There are many examples proving that ethical dilemma related to data mining also showed up in the public sector. In the United States, following the 9/11 attacks, the Congress passed the Patriot Act (Jonas & Harper, 2006), giving the government the right to conduct a narrow and intensive surveillance of the individuals within the US and abroad, in an attempt to prevent anti-terrorism plots. This act included the use of metadata collection. The implication on privacy rights was huge. In 2013, Edward Snowden, an employee of National Security Agency (NSA), leaked document to the Guardian newspaper revealing the institution’s domestic surveillance practices, involving spying on millions of innocent American citizens (Guardian, 2013). Following this, the NSA was accused by plaintiffs in December 2013 to make an unconstitutional use of data mining, by collecting domestic phone records. Personal data on virtually every citizen were systematically recorded and collected to detect any potential threat of an imminent attack (Greenwald, MacAskill, & Poitras, 2013). This was considered as being against the fundamental privacy right present in the American Constitution. The case is still being examined.

Discrimination: Discrimination is the “unjust or prejudicial treatment of different categories of people, especially on the grounds of race, age, or sex” (Oxford, 2014). In most developed countries, it is illegal and punished by the law. For example, In Europe numerous areas such as employment, access to housing, health care, or adoption are regulated and refusing an individual to have access to one of those based on criteria judged discriminatory is punished by the law. Such criteria include race, ethnicity, religion, nationality, gender, sexuality, disability, marital status, genetic features, language and age (B. Custers et al., 2013).

In data mining, discrimination arises at the second stage of profiling, the use of the information collected in order to make decisions. As we have seen, the large set of data are used to reveal hidden patterns and draw profiles of the population, whether it is for private companies or public sector purposes. However, these patterns and profiles can lead to misguided and discriminatory decisions.

There are two types of discrimination: direct and indirect discrimination (Hajian & Domingo-Ferrer, 2013). Direct discrimination takes place when two individuals are treated in a different way based on their difference on one of the fundamental criteria defined above (gender, religion…). Direct discrimination rarely occurs in data mining, because regulations are very strict about it. Moreover, as its goal is to target specific purchasing behavior, a company will not benefit from leaving on one side a portion of the customers because of their nationality for example. Indirect discrimination occurs when a final decision is unjustly treating a portion of a population, even though it was not based on the sensitive attributes denounced in direct discrimination. An example of indirect discrimination would be refusal of financial institutions to give mortgages or insurance to people coming from run-down areas. These institutions use data mining to draw a profile of urban areas. A certain zip code will refer to a category of area, and based on this information, mortgage and insurances can be refused. Here, starting from apparently neutral information (zip code), the procedure lead to a final decision that is obviously raising ethical concerns in terms of discrimination.

In the public sector, discriminatory decisions resulting from data mining technology are also emerging. Let us take the example of an education institution that is targeting its admission program advertisement toward applicants with the most likely successful profile. The top profile will be Asian males with a certain level of income and who finished in the top 10% of its high school class. The institution will decide to target this specific segment. But what about diversity? All the process is based on a certain notion of success, which could evolve through time. This is discriminatory towards other student that could have succeeded in this institution.

Those two examples underlines the fact that drawing profiles that will serve as a basis for decisions is deeply dependant to subjective notions and assumptions. Such assumption could turn out to be mistaken, and lead to indirect discrimination. What if I have chosen to live in a run-down area for specific reasons, but I would be perfectly able to pay the insurance company? Data mining can lead to misguided generalizations and thus discriminate certain people on the basis of inaccurate interpretation of data. In the same way that privacy is deeply related to consent and ownership, discrimination is linked to the accuracy of profiles’ interpretations derived from data analyses.

De-individualization is the consequence of the fact that profiling decisions are based on group characteristics. You are not judged on your individual profile, but “categorized” as belonging to a certain cluster identified through data analysis (B. H. M. Custers, 2010). Your value is not considered according to your individuality, but according to the group you belong to.

De-individualization can also lead to stigmatizations and stereotyping when the information about your group membership is publicly disclosed. Let us get back to the example of financial institution using zip codes to decide whether they should grant a loan. If people understand that individuals of a particular area are always rejected by those institutions, they automatically draw the conclusion that the area is a run-down neighbourhood.

Information asymmetry is the consequence of the power that gives the control of data from private or public institutions. When it comes to data, the individual is powerless, especially when the gathering of information was made without him being even aware of the process. Companies can use this unbalance knowledge by developing common strategies with other businesses to take advantage of the customer. Unfair practices such as discriminatory pricing are often denounced. Governments can also gain from asymmetries by spying on future potential political competitors, who don’t possess the capacity to use the IT at their advantage.

Inaccuracy as an issue does not concern the truthfulness of the data, but it includes a set of erroneous processes and predictions. Data mining processes involve business understanding, data understanding, data preparation, modeling building, testing and evaluation, deployment and finally business decision making (Stylianou, Winter, Niu, Giacalone, & Campbell, 2012). Every process is prone to error especially the processes of business and data understanding, model building and business decision making. The extent of the inaccuracy is reflected in the false positive and false negative rates, which measure how deeply was the prediction misguided. Ethical dilemma arises when the predictions are used for decision making, even if they are obviously inaccurate.

Trust between customer and business: The relation between firms and individuals in data collection and data mining is very strong. Any wrong behaviour can lead to multiple unexpected and unpleasant behaviours to both of them. One such reaction that could be perceived from individuals is the diffusion and deviation of information. It’s a concept that has been heavily discussed in social media networks and its effects however it also applies to data collection and data mining as well because not only data mining uses social media data but also because people are starting to realizing the gap between protecting individuals data and the revelation of practices of data mining in the recent past (Bakshy, Rosenn, Marlow, & Adamic, 2012).

Solutions

The purpose of this literature research is to identify unethical practices and implications of data mining but also to propose some recommendations which will not only improve the bilateral trust between the individuals and the businesses, but will also reduce the margin of error in data mining process and also help businesses make better decision and reduce the chance of receiving bad brand image.

These recommendations will not only address the issues discussed in previous section but will also highlight some of the good practices that could be incorporated in order to improve the data mining process.

Anonymity means allowing users or customers to be anonymous whenever possible. Data anonymity can be brought at the stage of data collection or at the stage of prediction and analysis disclosure (Spinello, 1998). At the stage of data collection, the amount of data attributes required should be identified and only those attributes about an individual should be collected or used in the predictive analysis. For example, in predicting child mortality rates in different regions of the country, the name of the parent, social insurance number and other personal identifiable information should not be collected or should not be disclosed when analysis has been performed on the data, as they would not be relevant to the type of research performed. But there can be situations where personally identifiable information is important to be included in the analysis. For example, the aboriginal women being kidnapped in different regions of Canada, would require all their personal identifiable information to be able to predict the causes and see trends to better analyze the data and make better decision. However, anonymizing individuals’ data is not a very effective way of addressing the issues of privacy, security and profiling because two unidentifiable attributes in a data can be combined to determine the individual, but it can certainly provide some protection to users’ data. Online businesses should provide shoppers with an option of shopping anonymously if they choose to.

Disclosure is the process of developing privacy policy for firms that explains to its users the actions that will be performed when data about them is collected. What actions will be performed, what actions will not be performed, whether data collected will be sold to third party or not, whether data about users will be bought from third party or not and etc. Davis and Patterson in their book Ethics of Big Data, have provided a summary of 50 fortune 500 companies and they’ve found that not all firms clearly and explicitly explain their actions on the user data. In their research, 10 policies did not state if they will or will not share users data with third parties and 11 policies stated that they would buy or obtain personal information from third parties. From these findings we can understand the gap that exists between users’ data and firms’ data handling processes (Davis & Patterson, 2012). Therefore for ethical data mining, it becomes very important that the user is aware that their data is being collected, they should be aware of what the firm will be doing with their data and they should also be made aware of what the firm will not be doing with their data. Only after having their consent and clear understanding, the firms should consider analyzing the data. Although current practices in place also suggests that most users are willing give away their personal information, but when personally identifiable information is involved, firms should clearly and immediately inform user of how the data will be used and they should use simple language for consent and must not influence user for desired data outcome.

Choice is associated with the freedom users have in either taking part in firms data collection process or not and the control users have over their data. The current market practice is the opt-out model, which is the default configuration requiring an additional action from the users to unsubscribe or de-register themselves (Spinello, 1998). Users are not given the choice of subscribing at the first place but rather given a choice of either staying subscribed or unsubscribing themselves, which in essence is a second degree choice. This model may result in frustrations on the users’ side and may impact the effectiveness and usefulness of the user-firm interaction process, ultimately affecting the data mining process negatively.

Another issue that revolves around the privacy policy and user consent is the amount of time for which the firm can have access to user data. Current practices do not mention any time limit the data will be used for, allowing firms to be using users’ data indefinitely. Data time limit should be placed in the privacy policies which will allow firms to process data for the necessary period of time and automatic processes should be in placed to discard the data after the required period.

Higher accuracy should be ensured so that users are not denied of service because of inaccurate prediction.
Ensure higher accuracy in data mining: The industry standard breaks the data mining process into six phases which are business understanding, data understanding, data preparation, modeling building, evaluation, and deployment (Turban et al., 2010). Each step is susceptible to error. It’s not a simple task to understand the business requirements and the data stored in the various formats and shapes. The separation and cleansing of good data from the bad data is an erroneous process followed by computer algorithms processing and analysing data, which have a margin of error depending on the type and size of the dataset. Predictors and variables selection could also be judged incorrectly. But most importantly business decision making, which comes after the six phases, is also prune to misjudgement. Accuracy in data mining is a salient feature, which not only affects the behaviour of customers towards the firms and their control of their data but it can lead to major financial consequences. The recent example of Target’s decision to identify pregnant customers and use target advertisement backfired and resulted in bad brand image. Higher accuracy should be ensured so that users are not denied of service because of inaccurate prediction.

Ethical data mining revolves around bilateral trust between customer and business. All the above recommendations made will improve customers’ trust on firms. However improved bilateral trust will reduce information diffusion and accuracy could be improved. One of the step that could improve users’ trust is that firms is to develop privacy policies and standards for the use of customer data. Firms should also perform internal audit to identify unethical practices on the use of customer data. This will not only improve the trust relationship but will also prevent bad brand image and expensive lawsuits from public outrage and legal actions against the firm.

Although most countries do not have a legal framework to regulate unethical practices of data and data mining but legality has its limitations as well, which cannot on its own guarantee ethical data mining practices or prevent unethical practices from occurring. Laws are restricted to geographic boundaries. US legal laws cannot be enforced on data that is being stored on servers in South America or Asia. Therefore, there is a need for a global or regional data mining governing body or association that provides standards and frameworks for firms to practice. The ISO/IEC 27001:2005 or ITIL 2011 standards are examples of globally accepted and practiced standards for Information Security Management System and IT service management respectively. This will not only define set of standards to follow but will also provide public awareness about users data, ownership and its use.

Conclusion

While the benefits of data mining and concerns of unethical data mining could affect both customers and firms, it will be unfair if standardized or legalized processes are not in place to protect the interests of both. The purpose of proposing ethical considerations and enforcing them is not just for the sake of ethical discussion, but to actually evaluate the financial and social impact of data mining on customers and firms. The proposed recommendations in this paper could not only mitigate the unavoidable concerns of unethical data mining in the absence of legal framework but can also be incorporated in the propose data mining government body as standards and good practices that firms would practice when dealing with customers’ data. But above all the notion is to consider human data as human and respect it as a human.

References

Atterer, R., Wnuk, M., & Schmidt, A. (2006). Knowing the User’s Every Move: User Activity Tracking for Website Usability Evaluation and Implicit Interaction. In Proceedings of the 15th International Conference on World Wide Web (pp. 203–212). New York, NY, USA: ACM. doi:10.1145/1135777.1135811

Bakshy, E., Rosenn, I., Marlow, C., & Adamic, L. (2012). The role of social networks in information diffusion. In Proceedings of the 21st international conference on World Wide Web (pp. 519–528).

Conway, R. (2012). Where angels will tread. The Economist. Retrieved from http://www.economist.com/node/21537967

Custers, B., Calders, T., Bart Schermer, & Zarsky, and T. (2013). Discrimination and Privacy in the Information Society. Data Mining and Profiling in Large Databases.

Custers, B. H. M. (2010). Data Mining with Discrimination Sensitive and Privacy Sensitive Attributes. Proceedings of ISP, 12–14.

Davis, K., & Patterson, D. (2012). Ethics of big data. (J. Steele & C. Nash, Eds.) (p. 79). Sebastopol, CA: O’reilly.

Friedman, W. H. (2005). Privacy- Danger and Protection. In Information Security and Ethics : Tools , and Applications.

Govtrack.us. (2011). Do-Not-Track Online Act of 2011. Govtrack.us. Retrieved from https://www.govtrack.us/congress/bills/112/s913

Greenwald, G., MacAskill, E., & Poitras, L. (2013). Edward Snowden: the whistleblower behind the NSA surveillance revelations. The Guardian, 9.

Guardian, T. (2013). NSA collecting phone records of millions of Verizon customers daily. Retrieved from http://www.theguardian.com/world/2013/jun/06/nsa-phone-records-verizon-court-order

Hajian, S., & Domingo-Ferrer, J. (2013). A methodology for direct and indirect discrimination prevention in data mining. Knowledge and Data Engineering, IEEE Transactions on, 25(7), 1445–1459.

Jonas, J., & Harper, J. (2006). Effective counterterrorism and the limited role of predictive data mining. Cato Institute.

Oxford. (2014). Oxford Dictionaries. Oxford University Press. Retrieved from http://www.oxforddictionaries.com

Payne, D., & Landry, B. J. L. (2004). A Composite Strategy for the Legal and Ethical Use of Data Mining. International Journal of Management, Knowledge and Learning, 1(1), 27–43.

Peters, R. S. (1970). Ethics and Education (Vol. 18). London,Allen and Unwin.

Spinello, R. A. (1998). Privacy rights in the Informationi economy. Business Ethics Quarterly, 8(4), 723–742.

Stylianou, A. C., Winter, S., Niu, Y., Giacalone, R. a., & Campbell, M. (2012). Understanding the Behavioral Intention to Report Unethical Information Technology Practices: The Role of Machiavellianism, Gender, and Computer Expertise. Journal of Business Ethics, 117(2), 333–343. doi:10.1007/s10551-012-1521-1

Turban, E., Sharda, R., Delen, D., & King, D. (2010). Business Intelligence: A managerial approach. (E. Svendsen, Ed.) (Second., p. 312). New Jersey: Prentice Hall.

Protecting Consumer Privacy in an Era of Rapid Change: Recommendations for Businesses and Policymakers, Fed. Trade Comm’n (Dec. 2010), available at http://www.ftc.gov/sites/default/files/documents

About the author

Said Zazai

Leave a Reply