A Novel Method for Determining Research Groups from Co-Authorship Network and Scientiﬁc Fields of Authors

Large networks not only have a large number of vertices but also have a large number of edges. Usually such networks are dense and difficult to visualise, even locally. This paper considers the case where large weights on edges represent proximity of the corresponding end-vertices. We follow two main ideas in this paper. The first one is \emph{network pruning}, that is removal of edges that makes the resulting network more manageable while keeping the main characteristic of the original network. The other idea is to partition the network vertex set in such a way that the induced connected components represent groups of network elements that fit together. Furthermore, we assume that the vertices of the network are labeled by \emph{types}. In this paper we apply our approach to co-authorship network of researchers in Slovenia in order to identify research groups, finding group leaders and the degree of inter-disciplinarity of the group. For the network pruning phase we use a pathfinder network and for vertex partition appropriate line-cuts. Each cluster is assigned a distribution of types. A measure of inter-disciplinarity of research group is derived from such a distribution.


Introduction
In contemporary research community scientists collaborate within formal or informal research groups. Identifying such groups from data available in various bibliometric networks is an interesting challenge. In this note we propose a method that uses the co-authorship network on the one hand and declared scientific field of authors that can be extracted from some bibliographic databases, on the other.
We propose a theoretical model that uses a network, i.e. graph with weights on edges and labels, called types, on its vertices. We may view labels as scientific fields or subfields. Our approach is quite general and can be applied to any weighted network with types. In this paper we apply it to co-authorship networks. Note that scientific fields are sometimes caller research interests.
The method consists of two steps. In the first step the original co-authorship network is pruned in order to reduce the number of edges and increase the number of components, in our case producing research groups. In this step line-cuts are determined. In the second step a collection of induced monotype subnetworks is pruned by applying MST-pathfinder algorithm to further reduce the number of edges while keeping the same connectivity. Our original contribution is combination of both methods and the use of symmetric predicate in the first step; see Algorithm 3. Note that the idea of using MST, pathfinder and MST-pathfinder has been used extensively in the past in variety of contexts of bibliographic and other research [6,8,11,22,23].
This rough general approach may be refined in several different ways. We present in detail only one such refinement and discuss some others in the conclusion. In general, bibliographic networks are very large and allow for a variety of methods for data mining [15], however in this pilot study we focus our attention on a relatively small data set. The data is restricted to Slovenian researchers and is taken from Slovenian bibliographic system SICRIS. Moreover, only researchers that are co-authors of mathematicians are considered.
2 Pruning of co-authorship network

Co-authorship network
For basics in graph theory, the reader is referred to [4], for network theory, see for instance [3].
Let V be a list of authors from some bibliographic database. We say that u, v ∈ V are adjacent: u ∼ v, if u and v are co-authors of a common work from the corresponding database. Sometimes we restrict our attention to certain types of works or certain types of co-authorships. Usually only scientific works are considered and the coauthorship graph is computed from a two-mode network WA composed of pairs (w, a), works and authors for each co-author a of work w. Since binary relation ∼ is irreflexive 1 and symmetric it defines a simple graph G = (V, ∼) that we call the co-authorship graph. Let E = {uv ∈ V 2 |u ∼ v} denote the set of unordered pairs of adjacent vertices of G. Instead of G = (V, ∼) we may use notation G = (V, E) to denote the same graph. The graph may be weighted where the weights w on the edges represent the number of joint papers between the two authors. In this way a network N = (V, E, w) is obtained. Let w(u, v) denote this weight. Sometimes, we may consider the weight of co-authorship differently for different number n(w) of co-authors of work w. Let W (u, v) denote the collection of works co-authored by u and v. For any work w let n(w) denote the number of authors of w. Then w(u, v) = |W (u, v)|. 1 Sometimes one may use also loops at each vertex. The weight associated with a loop may depend on the method that the co-authorship graph is constructed. If it is obtained by multiplication of two-mode networks [4] it represent the number of works for a given author. In the fractional approach it may represent the total contribution of an author. Loops are removed if we follow Newman's approach.
In a fractional approach [2] the weight f (u, v) is defined as: In case of Newman's normalization the weight is: A network N is a weighted graph N = (V, E, w), where w : E → R is the weight function. In our case it is positive and the value 0 means there is no edge between u and v.
A graph G = (V, ∼) is transformed into the network N = (V, E, a), where a(u, v) = 1 for all adjacent pairs of vertices u ∼ v. The same bibliographic database can produce at least three types of networks for the weight functions a, w, f, defined above: Let S u , S v be the corresponding sets. 10: if S u = S v then 11: Append e to F i . 12: for e = uv ∈ F i do 13: if S u = S v then 14: There is another aspect that we have not considered in this paper. Namely, the weight of an edge e = uv between two authors u and v may depend also on the total number of papers authored by each of the two authors. In this case we may modify the network to allow loops and define w * (u, v) = w(u, v) for u = v and let w * (u) = w * (u, u) denote the total number of papers having u as an author. Note that in general w * cannot be computed directly from w since we have no information about the single-authored papers. In this case the best way to compute w * is to multiply WA T by WA, where WA represents a two-mode network work-author. The theory of two mode networks and their applications to bibliographic data can be found, for instance in [3].

Pruning networks
In the analysis of large networks, dense networks present a challenge. Usually one tends to partition the set of vertices and investigate the induced networks on such parts. In [3] one may find a variety of concepts that are useful in such analysis, e.g. cuts, islands, etc. Nevertheless, such subnetworks may be dense again and the role of particular vertices is not clearly visible. For this reason we prune the original network N = (V, E, w) by appropriately selecting a subset of important edges E ⊂ E. If w denotes the restriction of w on E , the pruned subnetwork N = (V, E , w ) is obtained.
In case of co-authorship networks large weights indicate close collaboration between authors. When considering research groups one may assume strong collaboration within each group. Hence, in such a case a natural approach to pruning would be to remove all edges of lesser weights, while keeping the same connected components. A possible solution is given by the well-known maximum cost spanning tree. More precisely, in case of a disconnected network the resulting graph is a maximum cost spanning forest.
However, the problem with a maximum cost spanning forest is that, in case when several edges have the same weights, the forest may not be unique. We use a Kruskallike algorithm that produces a unique pruned network. Algorithm 1 is almost identical to the MST-pathfinder algorithm of [14] and produces the pathfinder network P n(∞, n − 1); for discussion and various aspects see also [19,5,20].
It is not hard to see, that the following is true:  In fact, the time complexity is the same as for Kruskal's algorithm [10]. The sorting and partitioning takes O(m log m) steps. There are two loops, each with O(m) steps, and the time complexity for the UNION-FIND is of lesser order.
By applying this pruning method strong ties among the nodes remain visible.

Line-cuts
For further refining the network N (V, E, w) one may choose a parameter t > 0, the threshold, or cut parameter and prune the edges with weights less than t. In this way the network N t (V, E t , w) is obtained, where The choice of parameter t depends on our aims. There are several obvious goals. For instance: 1. We may choose maximal value of t that guarantees at least a prescribed number of connected components, say κ.
2. An alternative is to insist that all components have at most prescribed number of vertices, say ν.
We present the basic pruning algorithm; see Algorithm 2. It produces essentially a line-cut, see for instance [3]. The only difference is that we keep isolated vertices.
Algorithm 5 Prune the network N = (V, E, w), given threshold parameter t. Connected components of the resulting network are called line-cuts. In Python, Algorithm 2 can be reduced to a single statement:

Pruning networks with vertex types
Let us assume we are given a finite number of types, or colors T , a network N (V, E, w) and a mapping c : V → T . The structure N (V, E, w, T, c) will be called a weighted network with vertex types. When pruning network with vertex types, a connected component consisting of vertices of a single type will be called monotype. Additionally, we will refer to the number of types used in a connected component as its type number. The maximum of type numbers of network components is called the type number of the network, In particular, we are interested in networks of low type number, preferably with monotype networks. Parameters of pruning may be adjusted in such a way that a monotype network is obtained. For networks with vertex types, in addition to the two goals described in Section 3, a third goal may be considered.
-One may insist that all connected components are monotype, or more general that each component has at most δ types (colors).
The following basic algorithm (Algorithm 3) for a given network with types removes all edges that have endpoints of different types, or more generally, when they satisfy a symmetric predicate P . if P (c(u), c(v)) and w(e) ≥ t then 4: Append e to F . 5: return subnetwork P r(N, t, P ) = (V, F, w, T, c).
As we mentioned above the predicate P usually is true if both endpoints are of the same type. However, other options are possible. Namely we may have a similarity imposed on the predicates and P signifies that two types are sufficiently similar.
We need an algorithm to analyse the network with vertex types; see Algorithm 4. Using these numbers we may select different parameters and re-run this algorithm to reduce the size of the maximal component or alternatively limit the number of different components. We may also insist that all components be composed of a single type.

Interdisciplinarity of research groups and leaders of research groups
For a given network with vertex types one may perform basic statistics on it. Namely, one may compute absolute frequencies of types on the vertex set.
where n = |V | and n i = |V i |. We consider two measures: and for each component: Both measure the diversity of research interests in a research group. If r(V i ) < 0.5 there is no dominant discipline. If r(V i ) = 1, the group is totally homogeneous.  One way to define a leader of a research group is to determine the vertex of maximal degree in the corresponding network, or even better the sum of weights of edges to the neighbouring vertices. There are two parameters that we are interested in. Let m be the number of edges of network N and let d be the maximal degree attained at vertex x. Let d be the second largest degree. Then x can be defined as a leader of the research group, while dominance is the quotient d/m and absolutism is defined by expression 1−d /d. Note that it would be also interesting to explore the diversity index [21] in this context. However, we will address all of these in a future work.

Example
The data used in our experiments was taken from COBIS-S/SICRIS [18]   Finally, the division of Mathematics in the Level 3 is indicated here: Level may be interpreted as the length of the research interest code that is used to test equality: for = 0, the string is not used at all, for = 1 only the first characters are compared, for = 2, the first four characters are compared, while for = 3 all seven characters are compared. Different levels can be associated with the suitable choice of predicate P in Algorithm 3. Let P denote the predicate applicable to level . For instance, for u = 1.01.01 and v = 1.01.04 P 2 (u, v) = while P 3 (u, v) = ⊥.
Here we give an example of a pruned research group network. We intend to perform a thorough analysis on more complete data set elsewhere. Figures 2 and 3 depict the same research group. The network in Figure 3 is tree-like and is obtained from the  In the database some researchers were assigned research interest at level 2, e.g 1.01 (Mathematics). For consistency, we expanded that to level three as 1.01.00. Note that the research group in Figure 3 is composed of two subgroups, one predominantly interested in graph theory and the other in algebra. There is a central triangle connecting the two subgroups.

Conclusion
Co-authorship graphs and networks are important in the study of research structure and dynamics; see for instance [7,9,12,1]. Their practical value has first been recognised by specialised systems, such as MathSciNet and Zb-Math; see [16,24]. Including them in more general bibliographic systems such as SICRIS [18] would be beneficial for most users. Potential applications are plenty. In this paper we presented only one aspect of such applications. In a recent paper [13] a completely different application is sought, namely, organising talks at a conference in such a way that speakers with similar topics are scheduled at different times.
The data that was available to us has also authors with UNKOWN research interest. In this preliminary study we considered it as a separate research interest. It would be interesting to repeat the study with some flexibility and con-sider the function: c : V → T ∪ {UNKNOWN}.
Clearly line-cuts refine the vertex partition and apply only within a component. Note that in general one could take different thresholds in different components. In case we intend to have components with given maximal size ν, then indeed different threshold values may be used. In or future more comprehensive work we intend to address some further extensions and applications of the MSTpathfinder method as well as some of the parameters that we have introduced.