# Building a coauthorship network with R

25 Jan 2019In this post, I will share with you my experience in the creation and visualization of coauthorship networks with R. We are going to focus on a particular type of network centered around one scholar (me in this example). The nodes of the network will be my coauthors (people with whom I published at least one paper) and the link between two coauthors will be proportional to the number of papers they cosigned (if any). We will first scrap data from Google Scholar using the R package **scholar** to build the network and then rely on the package **networkD3** to visualize it.

At the end of the process we will obtain this network.

### Scraping data from Google Scholar

We take as input my Google Scholar id and the one of my coauthors (who have a Google Scholar account). This information is available in the url of a Google Scholar webpage.

https://scholar.google.com/citations?user=**VSyM8fEAAAAJ**&hl=en

I did not manage to automatically get my full list of coauthors from Google Scholar with the functions of the **scholar** package so I did it manually. I gathered all the needed information in a csv file Coauthors that contains five columns:

**ID:**Unique integer node id. The first node is the central node (me)**Name:**Full name of the author as it will appear on the final network**Scholar:**Google Scholar id**Group:**You can define different group of coauthors displayed in different colors**W:**Weight of the node that will be used to set the size of the circles. We will update the value according to the number of publis in common with me.

We import the file in the dataframe **co**.

Google Scholar data can be messy, particularly if the profile of a scholar is not regularly manually cleaned. Since we will rely on the number of publis in common between every pair of my coauthors to build the network we need to ensure, as far as possible, insensitive strings comparison. More specifically, if two article titles are very similar like ** “Human Mobility: Models and Applications”** and

**for example, we want to consider them as a unique publication. For this purpose, I wrote the function**

*“Human mobility : models and applications”***simat**that returns a matrix of similarities between two vectors of character strings

**li**and

**lj**. The element

**ij**of the matrix is the fraction of letters in common between the

**ith**string of

**li**and the

**jth**string of

**lj**.

We can use this function to create two functions **duplipubli** and **intersectpubli** to remove doublons from a vector of article titles and compute the number of publis in common between two authors based on their vectors of article titles, respectively. I added the possibility to adjust a threshold value to determine if two strings correspond or not to the same article. After a few test I found that a threshold of **0.95** gives satisfying results. For example, the comparison between ** “Human Mobility: Models and Applications”** and

**returns a score of**

*“Human mobility : models and applications”***0.987**.

**duplipubli** computes the similarity matrix and removes iteratively the doublons (strings with similarity metric higher than the defined threshold value).

**intersectpubli** computes the similarity matrix and the number of strings in common between two string vectors.

We can now use these functions and the function **get_publications** (package **scholar**) to build the network by computing, for each pair of scholars, their number of articles in common. **get_publications** takes as inputs a Google Scholar id. Note that I filter out entries without year of publication.

The network is stored in the dataframe **net**.

We now set the node weights **W** according to the number of publis in common with me. Of course the number of publis that I have with myself is my actual number of publis. Since we don’t have this information yet I use the function **get_num_articles** (package **scholar**) to retrieve it. For my coauthors this information is available in the first **#coauthors - 1** weights of the network.

### Design and create the network

The function **forceNetwork** (package **networkD3**) create a D3 JavaScript network graph based on a set of nodes and links and their attributes (name, group, size for the nodes and value for the links). We therefore need to format the two tables **co** and **net** by selecting only the name, group and size of the node,

and the links with at least one publi in common.

Since the links will be plotted in the order they appear in **net** we need to reverse them if we want to put my links on the top.

We define two colors of link, **grey** for the links between my coauthors and **blue** for the links between my coauthors and me.

We then build the network with **forceNetwork**. Many aspects of the network such as the distance between nodes can adjusted with the parameters of **forceNetwork**.

### Export the network

You can finally export the network in **html** with the following piece of code.

I inserted it on my website made with **Jekyll** using the code below.

The scripts are available on my website.