Building a coauthorship network with R25 Jan 2019
In this post, I will share with you my experience in the creation and visualization of coauthorship networks with R. We are going to focus on a particular type of network centered around one scholar (me in this example). The nodes of the network will be my coauthors (people with whom I published at least one paper) and the link between two coauthors will be proportional to the number of papers they cosigned (if any). We will first scrap data from Google Scholar using the R package scholar to build the network and then rely on the package networkD3 to visualize it.
At the end of the process we will obtain this network.
Scraping data from Google Scholar
We take as input my Google Scholar id and the one of my coauthors (who have a Google Scholar account). This information is available in the url of a Google Scholar webpage.
I did not manage to automatically get my full list of coauthors from Google Scholar with the functions of the scholar package so I did it manually. I gathered all the needed information in a csv file Coauthors that contains five columns:
- ID: Unique integer node id. The first node is the central node (me)
- Name: Full name of the author as it will appear on the final network
- Scholar: Google Scholar id
- Group: You can define different group of coauthors displayed in different colors
- W: Weight of the node that will be used to set the size of the circles. We will update the value according to the number of publis in common with me.
We import the file in the dataframe co.
Google Scholar data can be messy, particularly if the profile of a scholar is not regularly manually cleaned. Since we will rely on the number of publis in common between every pair of my coauthors to build the network we need to ensure, as far as possible, insensitive strings comparison. More specifically, if two article titles are very similar like “Human Mobility: Models and Applications” and “Human mobility : models and applications” for example, we want to consider them as a unique publication. For this purpose, I wrote the function simat that returns a matrix of similarities between two vectors of character strings li and lj. The element ij of the matrix is the fraction of letters in common between the ith string of li and the jth string of lj.
We can use this function to create two functions duplipubli and intersectpubli to remove doublons from a vector of article titles and compute the number of publis in common between two authors based on their vectors of article titles, respectively. I added the possibility to adjust a threshold value to determine if two strings correspond or not to the same article. After a few test I found that a threshold of 0.95 gives satisfying results. For example, the comparison between “Human Mobility: Models and Applications” and “Human mobility : models and applications” returns a score of 0.987.
duplipubli computes the similarity matrix and removes iteratively the doublons (strings with similarity metric higher than the defined threshold value).
intersectpubli computes the similarity matrix and the number of strings in common between two string vectors.
We can now use these functions and the function get_publications (package scholar) to build the network by computing, for each pair of scholars, their number of articles in common. get_publications takes as inputs a Google Scholar id. Note that I filter out entries without year of publication.
The network is stored in the dataframe net.
We now set the node weights W according to the number of publis in common with me. Of course the number of publis that I have with myself is my actual number of publis. Since we don’t have this information yet I use the function get_num_articles (package scholar) to retrieve it. For my coauthors this information is available in the first #coauthors - 1 weights of the network.
Design and create the network
and the links with at least one publi in common.
Since the links will be plotted in the order they appear in net we need to reverse them if we want to put my links on the top.
We define two colors of link, grey for the links between my coauthors and blue for the links between my coauthors and me.
We then build the network with forceNetwork. Many aspects of the network such as the distance between nodes can adjusted with the parameters of forceNetwork.
Export the network
You can finally export the network in html with the following piece of code.
I inserted it on my website made with Jekyll using the code below.
The scripts are available on my website.