Efﬁcient Trajectory Data Privacy Protection Scheme Based on Laplace’s Differential Privacy

,


Introduction 1.Background
With the rapid development of computer and network, data mining and analysis plays an increasingly important role in our social life.The huge amounts of data (such as big data) can bring many application services to our society, such as trajectory (location) data, health and food data, traffic safety data, etc. Trajectory data is a kind of position information with large scale, fast changing and generally accepted characteristics, which mainly comes from vehicle networks, mobile devices, social networks and so on.Now many applications of trajectory data have facilitated people's daily life, thus trajectory data service is called as a kind of new mobile computing service.Currently, it is the key of developing trajectory data services that we must be able to learn and understand position information [1].However, trajectory data is mainly collected and disseminated by mobile equipments, but many mobile devices and mobile communication technologies must integrate geographical data and individual information into trajectory data, such as individual information may contain individual privacy data, personal health status, social status and behavior habits, etc, thus mining and publishing trajectory data may divulge individual sensitive information so as to influence people's normal life [2,3,4].Now it is the key of trajectory data privacy protection that how to protect sensitive trajectory data while providing trajectory information service on data mining.For example, if mined data is not processed and protected on fully open status, mined data may reveal user's privacy so as to affect user's normal life.Thus, it is double-edged sword that how to mine and use trajectory data.Namely we must find a compromising approach between service and protection.However, many existing privacy protection schemes cannot provide the balance of utility and protection.For example, the generalization method [5] cannot availably protect data, and the anonymous grouping method [6] is not efficient enough.Furthermore, because the records of trajectory data are discrete in database 1 , some existing privacy protection schemes are difficult to protect trajectory data.Therefore, we focus on finding an efficient privacy protection scheme for trajectory data in this paper.

Our contributions
In this paper, we propose a trajectory data privacy protection scheme based on Laplace's differential privacy mechanism.In the proposed scheme, the algorithm first selects the protected points from the user's trajectory data; secondly, the algorithm builds the polygons according to the protected points and the adjacent and high frequent accessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; finally, the noises are added to the polygon centroids by the Laplace' differential privacy method, and the new polygon centroids are used to replace the protected points, and then the algorithm constructs and issues the new trajectory data.The experiments show that the running time of the proposed algorithms is fast, the privacy protection of the scheme is effective and the data usability of the scheme is higher.

Outline
The rest of this paper is organized as follows.In Section 2, we discuss the related works about trajectory data privacy protection.In Section 3, we review the related definitions and theorems on which we employ.In Section 4, we propose an efficient trajectory data privacy protection scheme, which is based on the Laplace's differential privacy mechanism.In Section 5, we analyze and show the efficiency of the proposed scheme by the experiments.Finally, we draw our conclusions in Section 6.

Related work
Currently many privacy protection schemes are being widely used in many fields, such as secure communication, social network, data mining and so on.The works [5,6] first proposed the k-anonymity model to protect social network, whose anonymity protection methods mainly include generalization [7,8], compression, decomposition [9], replacement [10] and interference.Based on the works of [5,6], many other k-anonymous protection methods [11][12][13][14][15][16][17][18][19][20][21] were also proposed.However, the works [20,21,22] proved that some anonymous protection methods cannot protect sensitive data very well.Additionally, Cristofaro et al. [23] proposed a privacy-encrypted protection scheme.Although their scheme can ensure data security, data utility is decreased.Current location data privacy protection methods [1,24] are mainly classified to three categories: the heuristic privacy-measure methods, the probability-based privacy inference methods and the privacy information retrieval's methods.The heuristic privacy-measure methods [25,26,27,28] are mainly to provide the privacy protection measure for some no-high required users, such as k-anonymity [25], t-closing [26], m-invariability [27] and l-diversity [28].Also, although the information retrieval's privacy protection methods can achieve perfect privacy protection, there are more or less privacy information in the released data, so these methods may result in that no data can be released, and these methods have high overhead.Additionally, the probability-based privacy inference methods can protect data and achieve better data utility under certain conditions, but the effectiveness of the methods depends on original data availability.Further, the three kinds of methods are based on a unified attack model [1], which depends on certain background knowledge to protect location data.However, with the increase of background knowledge got by the attackers, these methods could not always effectively protect location data.The works [5,6,[11][12][13][14][15][16][17][18][19] showed the shortages of the relationship-privacy protection methods.Ting et al. [29] analyzed a variety of privacy threat models and tried to optimize the effectiveness of the data obtained while preventing different types of reasoning attacks.Bugra et al. [30] proposed the first effective location-privacy preserving mechanism (LPPM) that enables a designer to find the optimal LPPM for a LBS (location-based service) given user's service quality constraints against an adversary implementing the optimal inference algorithm.Such LPPM is the one that maximizes the expected distortion (error) that the optimal adversary incurs in reconstructing the actual location of a user, while fulfilling the user's service-quality requirement.Presently, it is the key of protecting location data to provide a privacy protection method not sensitively to background knowledge.Based on the requirement, differential privacy protection technology can exactly satisfy it.Differential privacy is a kind of strong privacy protection method, which is not sensitive to background knowledge.However, because location data has the characteristics of sparsity and farrago, many differential privacy protection methods are not enough efficient.He et al. [31] proposed a synthetic system based on GPS path, which can provide strong differential privacy protection mechanism.The proposed system gets different speed trajectory by using a hierarchical reference method to isolate the original trajectory, and then protects the speed trajectory.Chatzikokolakis et al. [32] proposed a predictive differentially-private mechanism for location privacy, which can offer substantial improvements over the independently applied noise.Their works showed that correlations in the trace can be in fact exploited in terms of a prediction function that tries to guess the new location based on the previously reported locations.Additionally, their works tested the quality of the predicted location using a private test; in case of success the prediction is reported otherwise the location is sanitized with new noise.Chatzikokolakis et al. [33] also showed a formal notion of privacy that protects the user's exact location-"geoindistinguishability", and then proposed two mechanisms to protect the privacy of user when dealing with locationbased services.Also they extended their mechanisms to the case of location traces, and provided a method to limit the degradation of the privacy guarantees due to the correlation between the points.Li et al. [34] proposed a compressive mechanism for differential privacy, which is based on compressed sensing theory.Their mechanism is to consider every data as a single individual, so it undermines the relationship of data so as to be not suitable to protect location data.Jia et al. [1] proposed a differential privacy-based transaction data publishing scheme.Their method establishes the relationship of transaction data items by a query tree and adds noises to the query tree based on the compressive mechanism and the Laplace's mechanism.However, it is difficult to measure the effectiveness of their method on privacy protection.Zhang et al. [35] proposed an accurate method for mining top-k frequent data records under differential privacy.In their scheme, the exponential mechanism is used to sample top-k frequent data records, and then the Laplace's mechanism is utilized to generate noises to distort original data.Although the effectiveness of their method may accurately be measured on privacy protection, their method neglects the relationship of transaction data items.

Differential privacy
Differential privacy protection can achieve privacy protection target by making data distortion, where the common approach is to add noises into queried results.The purpose of differential privacy protection is to minimize privacy leakage and to maximize data utility [36,37].Currently differential privacy protection has two main methods [38,39]-the Laplace's mechanism and the exponential mechanism.
DWork et al.
[39] proposed a protection method for the sensitivity of private data, which is based on the Laplace's mechanism.Their method distorts the sensitive data by adding the Laplace's distribution noises to the original data.Their method may be described as follows: the algorithm M is the privacy protection algorithm based on the Laplace's mechanism, the set S is the noise output set of the algorithm M , and the input parameters are the data set D, the function Q, the function sensitivity ∆Q and the privacy parameter ε, where the set S approximately subjects to the Laplace's distribution ( ∆Q ε ) and the mean (zero), as shown in the formula (1): Also, in their method, the probability density function of added noise subjecting to the Laplace's distribution is as the formula (2): where λ = ∆Q ε , namely the added noise is independent from the data set, and is only related to the function sensitivity and the privacy parameter.The main idea of their method adds the noises subjecting to the Laplace's distribution into the output result so as to distort the sensitive data to achieve data protection target.For example, in their method, let Q(D) be the querying function of top-k accessing count, then the output of the algorithm M can be represented by the following formula (3): where Pr represents the randomicity of the algorithm M on D and D , namely denotes the risk probability of privacy disclosure.ε represents the privacy protection level, where if ε is bigger, then privacy protection degree is lower; on the contrary, if ε is smaller, then privacy protection degree is higher.Definition 3.2 Data Sensitivity2 : Data sensitivity is divided to global sensitivity and local sensitivity, we set Q as query function, then the global sensitivity of the function Q is defined as follows: where D and D represent the adjacent data sets, Q(D) represents the output of the function Q on the data set D, ∆Q is the sensitivity and represents the maximum of the outputs' difference.
Additionally, because the ε-differential privacy protection scheme may be used many times in the different stages of processing data, the ε-differential privacy protection scheme also needs to satisfy the following theorems: Theorem 3.1 for the same data set, the whole privacy protection process is divided to the different privacy protection algorithms (M 1 , M 2 , ..., M n ), whose privacy protection levels are ε 1 , ε 2 ,...,ε n , so the privacy protection level n i=1 ε i of the whole process needs to satisfy differential privacy protection.Theorem 3.2 for the disjoint data set, the whole privacy protection process is divided to the different privacy protection algorithms (M 1 , M 2 , ..., M n ), whose privacy protection levels are ε 1 , ε 2 ,...,ε n , so the privacy protection level max{ε i } of the whole process needs to satisfy differential privacy protection.

Trajectory data privacy protection scheme
In the section, we propose a trajectory data privacy protection scheme, which employs the Laplace's differential privacy method to protect the user's trajectory data.In the proposed scheme, the algorithm first selects the protected points from the user's trajectory data; secondly, the algorithm builds the polygons according to the protected points and the adjacent and high frequent accessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; finally, the noises are added to the polygon centroids by the Laplace's differential privacy method, and the polygon centroids are used to replace the protected points, and then the algorithm constructs and issues the new trajectory data.The procedure of the proposed scheme is described as follows: (1) Input the trajectory data I, the related and historic point data set D 3 , the radius r and the differential privacy protection parameters ε and min_count 4 ; (2) Select the protected point set A from the trajectory data I, then select the point data f ∈ A and its corresponding adjacent points from D, where the adjacent points belong to the range of a circle that f is the center of the circle and r is the corresponding radius, and the frequent accessed counts of the adjacent points are no less than min_count, finally form the point set B; (3) Traverse the set B, and build the corresponding polygons according to the points f and its corresponding adjacent points from B, where only one point in every polygon belongs to the trajectory data I, and then calculate the corresponding polygon centroids, and form the polygon centroid set J, where j i (x, y) ∈ J is the polygon centroid (see Section 4.2 for more details); (4) Use the Laplace's mechanism to add the noises Lap( k•∆Q ε ) into the set J, where the noises are added into the polygon centroids, and then generate the set G (see Section 4.3 for more details); (5) Use the modified polygon centroids from G to replace the correspondingly protected points f ∈ A, and then issue the new trajectory data I .

Processing trajectory data
The section describes how to select the related data from the trajectory data I and the related and historic point data set D. The proposed algorithm selects the protected point 3 The related and historic point data include the historic location points accessed by people and the corresponding accessed counts.To the trajectory data, we may save the historic trajectory data and the related information (including accessed time and accessed count) to the database, and then the data may be classified to statistically form the set D. 4 Our proposed scheme focuses on highly frequent accessed location data so as to distort attacker's target.So, the setting of min_count is to improve the efficiency of the proposed scheme.data f ∈ A and its adjacent points from D. Figure 1 shows the procedure of selecting the related data.In Figure 1, Figure 1 Processing Trajectory Data a random trajectory of one user is shown, where the red circles and the red arrows are used to show the trajectory, and the green circles denote the accessed historic location points 5 , which build the related and historic point data 6 set D. According to the Figure 1, the procedure of selecting the related data may be described as follows: -The proposed algorithm inputs the trajectory data I of one user, the related and historic point data set D and the related privacy protection parameters r, ε and min_count; -The algorithm selects the protected point set A from the trajectory data I; -The proposed algorithm forms the point set B according to the point data f i ∈ A and its corresponding adjacent points from D, where the adjacent points belong to the range of a circle that f is the center of the circle and r is the corresponding radius, and the frequent accessed counts of the adjacent points are no less than min_count.

Building polygon model
The section describes how to build the polygon model to compute the polygon centroid.The proposed algorithm builds the polygons according to the protected points f ∈ A and the corresponding adjacent points from D. Figure 2 shows the procedure of building polygon.
In Figure 2, the trajectory of one user is f 1 , f 2 , ......f 5 ∈ I, and the points h 1 , h 2 , ......h 13 with accessed counts come from D, where f 2 , f 4 ∈ A are the protected points.In the green circle that f 2 is the center of the circle and r is the corresponding radius, the points h 1 , h 2 and h 4 (∈ D and their accessed counts ≥ 50) and the point f 2 are used to form a polygon.Then the proposed algorithm computes the polygon centroid j 1 (noises are added to j 1 to generate a new point g 1 ).Similarly, the algorithm may traverse the set B to build the polygons.We need to remark that the points h 1 , h 2 and h 4 is nearby the point f 2 , thus the points may be used to build the polygon so as to maintain the usability of the modified trajectory, and that we set min_count is 50, thus some points whose accessed counts are less than 50 are not used to build the polygon in the green circle, such may distort the attacker's target and improve the efficiency of the proposed scheme.The procedure of building polygon model may be described as follows: -The algorithm traverses the set B, and then selects the relevant and max-sized points to build the polygons according to the distance.For example, to a potential polygon, the algorithm selects N points as vertices from B whose coordinates are P (x i , y i ) with i = 1, 2, 3......N , where one of the N points is in the original trajectory, and the other points are nearby the point; -The algorithm computes the polygon centroids according to the vertices of the formed polygons.The formulas is described as follows: .
where P i (x k , y k ) is the coordinate of the k_th vertices of the i_th polygon, |P i | is the vertices number of the i_th polygon, and j i (x, y) is the coordinate of the i_th polygon centroid.
-The polygon centroids are formed to the set J, where j i (x, y) ∈ J.

Adding noises based on the Laplace's mechanism
In the section, we show how to use the Laplace's mechanism to add the noises Lap( k•∆Q ε ) 7 into the set J. The main steps of the algorithm are described as follows: -Input the privacy protection level ε and the polygon centroid set J, and then generate the noise Lap( k•∆Q ε ) satisfying the probability Pr(j(x, y), λ), where In the above formula, the variant j(x, y) denotes the corresponding coordinate of the polygon centroid and λ = k•∆Q ε .
-Add the noises Lap( k•∆Q ε ) into the set J so as to disturb the polygon centroids 8 : where j i ∈ J, j i (x, y) denotes the coordinate of the i_th polygon centroid, and Lap( k•∆Q ε ) is each round of the independent noise subjecting to the probability Pr(j(x, y), λ).Finally, the algorithm generates the set G.
-Use the modified polygon centroids from G to replace the correspondingly protected points f ∈ A, and then issue the new trajectory data I .For example, as the Figure 2 shown, the noise is added to j 1 to generate a new point g 1 , and then g 1 is used to replace the point f 2 , thus the original trajectory 5 Experiment and efficiency analysis of the proposed scheme In the section, our experiments are mainly from two aspects to evaluate the efficiency of the proposed scheme: the first one is the running time of the proposed algorithms, namely the time of extracting the available data; the second one is the effectiveness of the proposed algorithms, whose indexes include the trajectory deviation rate and the trajectory accurate rate.The test original data set comes from the simulation on the Baidu map 9 , which is similar to the Gowalla 7 ∆Q is the sensitivity of the query function Q, where we set ∆Q = max{ (P i .xk − j i .x) 2 + (P i .yk − j i .y) 2 } with i = 1, 2, ......|N P | and k = 1, 2, ......|P i |, |N P | is the number of the polygons and |P i | is the number of the vertices of every polygon. 8If the formed polygon is on the left of the protected point from the trajectory data I, then the operation " + " is used; otherwise, the formed polygon is on the right of the protected point from the trajectory data I, then the operation " − " is used. 9Baidu is a network company in China.The baidu map is one of the network services provided by the company, which provides a lot of APIs for programmers to develop their applications on the map.data set 10 .The test original data set contains user_id, accessed time, longitude and latitude and so on.The period of the test original data set is about one month.All proposed algorithms are coded by C++ and codeblocks 11 .The related parameters for the test are set as Table 1.

Running time analysis
In the section, we test the running time of the proposed algorithms mainly through the time of extracting the available data, namely we test the effectiveness of computing all the polygon centroids from the available data.In the tests, when we set r=70 and ε=1,2,3,4,5,6,7,8,9,10,11,12 respectively, the time of extracting the available data is described as Table 2.
From the Table 2, we may know the time of extracting the available data is very fast, and the efficiency of computing all the polygon centroids from the available data is always increasing with the increasing of ε in a certain range.

Protection effectiveness analysis
In the section, we test the protection effectiveness of the proposed algorithms mainly through the trajectory deviation rate and the trajectory accurate rate, where the trajectory deviation rate is the angle θ formed by the modified polygon centroid and the original trajectory points, shown as Figure 3, and if the trajectory deviation rate is bigger in a certain range, then the protection effectiveness is higher; the trajectory accurate rate is used to test the protection effectiveness and usability of the noise-added data, and if the trajectory accurate rate is smaller in a certain range, then the usability is higher.
In the test, we compute the trajectory accurate rate through the following methods: 1) set the coordinate (a i , b i ) of the polygon centroid; 2) compute the hypotenuse , where c i is the original hypotenuse and c i is the noiseadded hypotenuse.The trajectory deviation rate is bigger in a certain range, the protection effectiveness is higher; the trajectory accurate rate is smaller in a certain range, the usability is higher.So, when we set ε = 5, 10, 15 and r = 40, 50, 60, 70, 80, 90, 100, 110 respectively, Table 3,4,5 show the deviation rate and accurate rate of the trajectory data.From the Table 3, when ε = 5 and r < 90, we may know that the polygon centroid is not changed with the increasing of r , thus the deviation rate θ and the accurate rate Z are also not changed.Such shows that in the range of r < 90, the new points are not selected to build the new polygon, thus the polygon is not modified.when r >= 90, the new points are selected to build the new polygon, thus the polygon centroid is recomputed, thus the deviation rate θ and the accurate rate Z are changed.Such shows that the deviation rate θ could become big with the increasing of r, and the data usability becomes small.Also, from the Table 4 and the Table 5, when ε = 10, 15, we may get the similar results as that of the Table 3.Additionally, when we fixedly set r = 70 and ε = 1, 2, 3, 4, ......15 respectively, Table 6 shows the deviation rate and accurate rate of the trajectory data.From the Table 6, we may know that the deviation rate θ and the accurate rate Z are always increasing with the increasing of ε.That is because the constraint condition becomes small with the increasing of ε in the differential privacy mechanism.However, such also shows that the deviation rate θ becomes big so that the data usability becomes small.

Conclusions
Currently, because the records of trajectory data are discrete in database, some existing privacy protection schemes are difficult to protect trajectory data.In this paper, we propose a trajectory data privacy protection scheme based on Laplace's differential privacy mechanism.In the proposed scheme, the algorithm first selects the protected points from the user's trajectory data; secondly, the algorithm builds the polygons according to the protected points and the adjacent and high frequent accessed points selected from the accessed point database, then the algorithm calculates the polygon centroids; finally, the noises are added to the polygon centroids by the differential privacy method, and the new polygon centroids are used to replace the protected points, and then the algorithm constructs and issues the each round of the independent noise subjecting to the Laplace's distribution, and the noise is proportional to ∆Q and inversely proportional to ε. Definition 3.1 ε−Differential Privacy: Given two adjacent data sets D and D where at most a data record is different between D and D (|D = D | = 1), for any algorithm M , whose range is Range(M ), if the result S outputted by the algorithm M satisfies the following formula (4) on the two adjacent data sets D and D (S ∈ Range(M )), then the algorithm M satisfies ε−differential privacy.

Figure 2
Figure 2 Building Polygon Model

Figure 3
Figure 3 Trajectory Deviation Angle