The top 50 astronomy keywords of all time

In a previous post I took a look at the collaboration numbers among astronomers, and verified that groups of more than five coauthors have become prevalent in the last five years or so. A natural question that arises from this is “What are the most popular topics in astronomy publications?”.

To answer this question, one can look for the “Keywords” section that appears in most “recent” (i.e., from 1970 onwards) astronomy papers, and count the number of actual keywords listed. Once again, the NASA ADS Astronomy Abstract Service proves invaluable for this task. So without further ado, here are the results:

mostFreqKWThese keywords correspond to refereed articles only, and they have all been converted to lower case. The search gave 128, 394 different keywords. The 3 least-frequent keywords were “zz ceti”, “zz psc” and “zz uma”, with only one instance each. The keyword “dark energy” was number 311 in the list, “mars” was number 881, and  “extrasolar planets” was number 1,410.

It is also interesting to look at the frequency of keywords over time. I picked four keywords (“stars: atmospheres”, “brown dwarfs”, “gamma rays: observations”, and “astrochemistry”), and plotted their available yearly counts since 1990. I did this for each of three journals: The Astrophysical Journal (ApJ, shown in red below), Monthly Notices of the Royal Astronomical Society (MNRAS, blue), and Astronomy & Astrophysics (A & A, green). I also included the yearly counts for all the journals combined (shown in black):

fourkeysCan you explain the two peaks in the black curve for “gamma rays: observations”?

Check out this cool interactive website built by Stefano Meschiari, in which you can choose a keyword and see its behavior in time.

A quick look at salaries in Austin

The city of Austin, Texas, is well known for being a center for technology and business. Many large companies, such as Dell, IBM, Google and National Instruments, to name a few,  are headquartered or have regional offices here. According to Wikipedia, Austin was ranked number 1 by Forbes among all big cities for jobs in 2012, and it was also ranked number 1 by the Wall Street Journal Marketwatch for growing businesses, naming it the “San Francisco of Texas”.

I wanted to learn a bit more about the type of industries that have presence in Austin, as well as how much they pay (useful to know if you are in the job market),  and I turned to the website Glassdoor.com, which contains first-hand information on companies, salaries associated with different job positions, and even employee reviews.

I did a search of all the companies listed on the website, and there were a little over 2,400. Not all of the listed companies included information about their positions or salaries (especially true for small companies). The first thing I scraped off the website was the distribution of companies over different industries. The results are summarized in this graph:

industriesATX

Austin’s character as a tech hub is obvious from this bar chart, where IT companies stand out. Keep in mind that this information is changing constantly.

How are salaries distributed across industries? Gathering the available salary figures for all positions in each company, and for every industry, gave this result:

industries_salariesATX

Note that the salaries taken to produce this plot are averages for each job title. The black dot associated with each industry represents the median of the corresponding average salaries. The left end of each dashed bar corresponds to the minimum salary for that industry, while the right end is the maximum value. The left and right edges of the rectangular boxes denote the 1st and 3rd quartiles, respectively. There are several outliers indicated by the empty circles. Two industries in this plot (Non-Profit and Insurance) do not appear in the first plot, because not enough companies were listed in those industries on the website, which of course doesn’t mean that there aren’t many such companies in Austin.

Next, I counted the total number of different job titles. Unsurprisingly, a tech-related title stood out. I include the 15 most numerous job titles below:

positionsATXThe final count of different job titles turned out to be 13, 525. This number includes those cases in which there exist several variations of a job title. For example, similarly to Software Engineer, there are titles such as Software Development Engineer I, Software Development Senior Engineer, Software Applications Engineer, etc.

Salaries for some job titles are shown in this figure:

positions_salaries

It’s also interesting to look at how salaries for one job title vary across different companies. Take a peek at this graph for Electrical Engineer:

elecThe highest salary is approximately twice the amount of the smallest one. Now take a look at the case of Account Manager:

account The abrupt changes shown in the bars are due to my selection of companies to include in the plot; since there are many more companies that can fit in one graph (over 100 of them), I picked three groups of companies from the ones available. The highest-paying company for Account Manager offers more than five times the salary of the one that pays the least.

In the two preceding plots, it’s worth bearing in mind that the same job title may involve different duties, depending on the company, so one has to be cautious about interpreting this raw information.

How do physicists and astronomers team up to write research papers?

The way in which physicists and  astronomers team up to write technical papers has changed over the years, and not only is it interesting to look at this behavior for its own sake, but by analyzing the data it may be possible to better understand what role, if any, does the number of authors  have on the scientific impact of a paper. Likewise, such an analysis can allow physics and astronomy journals to make decisions about their publishing policies.

I was curious about the trends in the number of authors per refereed astronomy paper, so I set out to write an R script that would read in data from the NASA Astrophysics Data System, an online database of both refereed and non-refereed academic papers in astronomy and physics. The script counts the monthly number of refereed astronomy and physics papers between January 1967 and September 2013, as well as the number of authors in each of those papers, and plots the results. It was a good exercise in manipulation of regular expressions. It is possible that the script can be made more efficient by using different R structures than the ones I used, but I wanted to look at the results first.

Here’s a plot of  the monthly mean and maximum number of authors:

Figure 1

Figure 1

The upward trend in both sets of data is readily apparent. The maximum is much more variable, though. The record maximum number of authors for a single paper is 3062, for May 2011, which belongs to a study of proton-proton collisions performed at the Large Hadron Collider. The variability of the mean number has increased in the last three years or so.

Let’s decompose both series into their constituent parts, namely trend, seasonal and random:

maxdecommeandecom

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In both cases, there is very little seasonality, and the random component is more stable for the series of mean number of authors.

As it turns out, there was  a nice study by Schulman et al. (1997) in which it is reported that the annual mean number of authors per astronomy paper, published in different journals, increased between the years 1975 and 1995 by a factor of between 1.5 and 2. We see something similar here. In their article, those authors showed that the fraction of papers written by a single author decreased steadily in the same period by a factor of about 3. To check that out, I looked at the monthly data of fraction of papers written by N authors, with N=1, 2, 3, 4, 5, and also N>5. This is what I found:

fpapers

Single-author papers have decreased by a factor of almost 4 since 1967 (black line). The period covered by Schulman et al. shows a similar trend here, even though I’m also including physics articles. Papers written by two authors (red line) grew in quantity until the mid-1980s, and then slowly decayed to their present fraction of 20%.

Perhaps the most noteworthy behavior is that of papers written by more than five authors (orange line). From making up only 1% of the total papers in 1967, they are now the most numerous, at 28%. There was an abrupt rise in the trend of these N>5 publications in 1990, which coincides with the launch of the Hubble Space Telescope, and it hasn’t changed significantly ever since.  Performing a Holt-Winters fit, which is a method that uses exponentially weighted moving averages to update estimates of the seasonally adjusted mean, the slope and seasonals, we can make a forecast of the fraction of papers that will be written by more than five authors in the next 24 months. R has functions called HoltWinters and forecast.HoltWinters that do this. The result can be seen in this figure:

fpapers_fore

The blue line represents the forecast of relative number of papers until December 2014. The dark and light shaded areas are the 85% and 95% confidence regions, respectively.

Schulman et al. (1997) have pointed out possible explanations for the decrease in the number of single-author papers and increasing numbers of multi-author articles. One is the “growth of multiwavelength astrophysics… which requires astronomers to be proficient in multiple wavebands or to collaborate with experts in other wavebands”. Another possible cause is that the increasing competition for jobs and grants forces applicants to look for collaborations that will result in a longer publication list for their CV. Also, the development and availability of rapid communication through the Internet has boosted the interactions among colleagues.

It would be interesting to look at economic indicators side by side with the data shown above, to try to elucidate if there’s a connection between number of authors and economic trends.

Reference: Schulman, E. , French, J. C., Powell, A. L., Eichhorn, G., Kurtz, M. J. and Murray, S. S. 1997, Publications of the Astronomical Society of the Pacific, 109, 1278

Analyzing the ups and downs of planet factories

Planets, asteroids and comets are assembled inside mysterious gaseous structures called protoplanetary disks, which form around a newborn star (see Figure 1 below). The gas in these disks is turbulent due to a combination of magnetic, chemical and kinematic effects, and this turbulence is the driving agent behind a disk’s evolution. The stronger the turbulence, the faster the disk gas spirals into the central star.

Artist's impression of a protoplanetary disk. Credit: NASA

Figure 1: Artist’s impression of a protoplanetary disk. Credit: NASA

One way in which astrophysicists can understand these systems better is through numerical simulations of small disk patches . One of the simulation outputs is a very important quantity, known as “α” (alpha) in the planet-formation field, which is basically a dimensionless measure of the intensity of the turbulence. If scientists are able to decipher how disk turbulence varies during the lifetime of a protoplanetary disk (which can be on the order of 1 to 10 million years), we could potentially explain many fascinating features of our own solar system and of other planetary systems.

Up until now, astronomical observations  of disks have provided only indirect, order-of-magnitude estimates of α. In spite of this, simulations that include the relevant physics allow us to glean the random variability of a disk’s α, and this provides theorists with a vital piece of information that can be used as input for their disk models.

However, thorough numerical descriptions of disks require large amounts of computational resources, particularly computing time, as well as knowledge of the complex codes that are used. What if it was possible to obtain α-values without having to carry out full 3-D simulations of disks every time someone needs them?

One way to do this is by performing a time series analysis of α-data from previously-run disk simulations, and obtaining a stochastic  model that can be used in subsequent disk calculations. The statistical language R contains many packages and functions ideally suited for this task, and the analysis below uses those. Excellent introductions to the use of R in time series analysis can be found in “Introductory Time Series with R”, by P. S. P. Cowpertwait and A. V. Metcalfe (Springer), and in “Time Series Analysis WIth Applications in R”, by J. D. Cryer and K. Chan (also published by Springer). What follows is not intended to be a rigorous or comprehensive study, and I would welcome comments and suggestions to improve these results.

Figure 2 shows a realization of α from a typical protoplanetary disk calculation, using an astrophysical-fluid-dynamics code. The unit of time is one orbit, which is one revolution of the disk patch being simulated around the central star (this can easily be converted to years, or any other desired unit). Data points are spaced by 0.5 orbits. Here we are discarding the first 70 orbits, because it takes the turbulence some time to set in after the beginning of the simulation, and we want to make sure that the data are truly capturing the turbulent state.

alfa1

Figure 2: Time series of the turbulent intensity of a protoplanetary disk, as calculated from a numerical simulation.

The correlation between α-values at time t and values at times t-k, with k=1, 10 and 200 representing the number of time steps or “lag” for each data point, are shown in Figure 3. Values that are separated by one lag (half an orbit) are highly correlated, as evidenced by the black symbols. On the other hand, α-values that are 200 lags (100 orbits) apart show almost no correlation, as can be seen from the blue symbols. The corresponding correlation coefficients are indicated in the figure.

alfacor

Figure 3: Scatter plot showing the correlation of turbulence intensity values between time t and times t-1 (black symbols), t-10 (red symbols) and t-200 (blue symbols).

Another way of examining correlations between pairs of data points is to plot the sample autocorrelation function (or sample ACF). A correlogram is a helpful visual tool that can be used for this purpose. Figure 4 presents the sample ACF of the data in Figure 2; for each k, the correlogram shows the correlation between α1 and α1+, α2 and α2+k , α3 and α3+k, etc.

Figure 4: Sample autocorrelation function (ACF) of data from Figure 2. See text for details.

Figure 4: Sample autocorrelation function (ACF) of data from Figure 2. See text for details.

The blue horizontal lines in Figure 4 are drawn at ±2/√N, where N is the number of data points. That is, the lines represent two approximate standard errors above and below zero. The sample ACF has a sampling distribution from which we can infer whether the data come from a completely random series. If the sample ACF falls outside the lines (as it does for most lags in Figure 4), we have evidence against the null hypothesis that the autocorrelation function equals zero, at the 5% level. In other words, the correlations are statistically significant. If the sample ACF remained within the 95% confidence band, it would be reasonable to infer that no correlations exist.

The text annotation in Figure 4 reports the result of an augmented Dickey-Fuller (ADF) test on the α-data of Figure 2. I won’t go into the details of the test (an exposition can be found in the books mentioned above), but generally speaking, it can be used to quantify the certainty about whether a time series is non-stationary (strictly speaking, whether it contains a unit root, which is the null hypothesis). A non-stationary series has a non-constant mean and/or variance, and shows significant correlations in its correlogram. Additionally, in that case the Dickey-Fuller statistic will be a negative number with a relatively small absolute value. That seems to be the case for our data, and Figure 4 supports the notion that the disk turbulence intensity from the simulation is non-stationary.

Now, to proceed further we need to transform the original time series into a stationary one, that is, into a process which is, in a sense, in statistical equilibrium. We can achieve that by taking the natural logarithm of, and then differencing, the series. This is shown in Figure 5, where the left panel corresponds to the transformed series, and the right panel to its sample ACF.

alfalogdiff1_acf

Figure 5: Differences of the natural logarithm of turbulent intensity (left panel), and their associated ACF (right panel). The results of an ADF test are shown in the right panel.

The ACF shows a slow decay, but after some 350 lags no correlations remain. At this point, I would be tempted to say that even after this transformation (and some others that I’m not showing here) the series still remains non-stationary, given what we see in the correlogram. However, the augmented Dickey-Fuller test gives us reason to reject the null hypothesis of a unit root, or in other words, non-stationarity. Moreover, the transformed series shows a relatively stable mean compared to the original series in Figure 2, and the variance exhibits less change, except at a few times around 250 orbits. These larger changes would need to be incorporated into a more elaborate model, but for the time being I’m going with the result of the ADF test and take the transformed series to be stationary.

So, knowing that it was necessary to take the differences of the logarithms of α, we can use the “arima” function in R to obtain a model of that particular realization of α. “ARIMA” stands for Auto Regressive Integrated Moving Average, which describes the structure of a time series model consisting of recursive terms (the AR part) and terms of independent and identically distributed random variables (the MA part). The I in ARIMA refers to the fact that we had to difference the series to make it stationary. Exactly how many AR and MA terms are needed, and how many times we need to take differences, lies at the heart of the time series analysis problem. In this case, by trial and error, 5 AR terms and 5 MA terms are found to most accurately describe our stochastic process.

Using arima on the natural logarithm of the α-series, and once we specify that we differenced it once, R computes the coefficients of the ARMA components and their associated standard errors:

AR1 AR2 AR3 AR4 AR5 MA1 MA2 MA3 MA4 MA5
Coefficient 0.295 -0.058 -0.058 0.942 -0.352 -0.555 0.017 -0.019 -0.982 0.570
S.E. 0.237 0.067 0.062 0.067 0.179 0.232 0.023 0.023 0.025 0.210

We can see that the coefficients for terms AR2, AR3, MA2 and MA3 are not significant. The resulting model is then

CodeCogsEqn

where the wt‘s are white noise variables. This expression could, potentially, be used in relatively simple numerical models of disks that share physical characteristics with the one that produced α in Figure 2.

It is necessary to perform some diagnostics before choosing the model as a statistical descriptor of our data. To that end, we examine the standardized residuals obtained from our model fit. If the model is adequately specified and the estimates of the parameters are reasonably close to the true values, then the residuals should behave roughly like independent, identically distributed normal variables.  If we plot their behavior over time, they should scatter approximately equally above and below zero, and not show any trends. This appears to be the case, as shown in Figure 6:

    Figure 6: Standardized residuals of the ARIMA fit to our data.

Figure 6: Standardized residuals of the ARIMA fit to our data.

We can assess the normality of these residuals by means of a quantile-quantile, or Q-Q, plot. This graphical tool can be used to display the quantiles of the data versus the quantiles of a normal distribution. If the data are normally distributed, they will form an approximately straight line. They are close to doing that in Figure 7, except at the edges:

    Figure 7: Normal Q-Q plot of the residuals from our ARIMA model. The results of a Shapiro-Wilk test are shown. See text for details.

Figure 7: Normal Q-Q plot of the residuals from our ARIMA model. The results of a Shapiro-Wilk test are shown. See text for details.

A Shapiro-Wilk test for normality can be performed to calculate the correlation between the residuals and the normal quantiles. A  p-value of 0.69 is obtained, and we therefore can not reject the null hypothesis that the residuals are normally distributed.

As a final diagnostic for our model in this exercise, we look at the ACF of the residuals to establish whether any autocorrelations remain.:

ACF of residuals obtained from the ARIMA fit to the turbulence intensity data.

Figure 8: ACF of residuals obtained from the ARIMA fit to the turbulence intensity data.

Figure 8 shows that is not the case, and our model could be a good candidate to be used as a proxy of the turbulent activity in a protoplanetary disk, provided its physical features are similar to those of the 3-D numerical simulation.