In the Spellcaster! project, we combined observational analysis data and s2s forecast data to form the forecast inputs.
By the default scratch code I wrote, the process is very slow. For 2000 stations, it took nearly 30 minutes to complete the combination.
I speculated the bottleneck is in the IO, so here I try to use multiprocessing
in python to increase the IO speed.
The orginal hotspot code:
for idx, row in sta_df.iterrows():
sta_num=str(int(row['区站号']))
# print(sta_num+' '+row['省份']+' '+row['站名'])
lat_sta=conv_deg(row['纬度(度分)'][0:-1])
lon_sta=conv_deg(row['经度(度分)'][0:-1])
var=var1.sel(lat=lat_sta,lon=lon_sta,method='nearest')
clim_var = var.loc['1981-01-01':'2010-12-31'].groupby("time.month").mean()
ano_var = (var.groupby("time.month") - clim_var)
ano_series=np.concatenate((ano_var.values,np.array((0.0,)),(fcst_var1.sel(LAT=lat_sta, LON=lon_sta, method='nearest').values,)))
np_time=np.append(hist_time.values, np.datetime64('now'))
np_time=np.append(np_time, fcst_time.values)
df =pd.DataFrame(ano_series, index=np_time, columns=['prec_ano'])
df=df.fillna(0)
df.to_csv(blend_outdir+sta_num+'.prec.csv')
Using multiprocessing
and rewite this part, the main
function:
def main():
# number of processes in use
ntasks=4
# PREC/L data
ds = xr.open_dataset(prec_arch_fn)
var1 = ds['precip'].loc['1979-01-01':,:,:]
hist_time= ds['time'].loc['1979-01-01':]
#print(var1.loc['1981-01-01':'2010-12-31',:,:])
clim_var1 = var1.loc['1981-01-01':'2010-12-31'].groupby("time.month").mean()
ano_var1 = (var1.groupby("time.month") - clim_var1)
#S2S data
ds_s2s = xr.open_dataset(s2s_fcst_file)
fcst_var1=ds_s2s['anom'][0,0,0,:,:]
fcst_time=ds_s2s['TIME']
np_time=np.append(hist_time.values, np.datetime64('now'))
np_time=np.append(np_time, fcst_time.values)
# Get in Station meta
sta_df=get_station_df(sta_meta_file)
print('Parent process %s.' % os.getpid())
# start process pool
process_pool = Pool(ntasks)
len_df=sta_df.shape[0]
len_per_task=len_df//ntasks
# open tasks ID 0 to ntasks-2
for itsk in range(ntasks-1):
process_pool.apply_async(combine_data, args=(itsk, sta_df[itsk*len_per_task:(itsk+1)*len_per_task], ano_var1, fcst_var1, np_time, blend_outdir,))
# open ID ntasks-1 in case of residual
process_pool.apply_async(combine_data, args=(ntasks-1, sta_df[(ntasks-1)*len_per_task:], ano_var1, fcst_var1, np_time, blend_outdir,))
print('Waiting for all subprocesses done...')
process_pool.close()
process_pool.join()
print('All subprocesses done.')
The parallelized function:
def combine_data(itsk, sta_df, ano_var1, fcst_var1, np_time, npblend_outdir):
print('Run task %s (%s)...' % (itsk, os.getpid()))
start = time.time()
for idx, row in sta_df.iterrows():
sta_num=str(int(row['区站号']))
# print(sta_num+' '+row['省份']+' '+row['站名'])
lat_sta=conv_deg(row['纬度(度分)'][0:-1])
lon_sta=conv_deg(row['经度(度分)'][0:-1])
ano_var=ano_var1.sel(lat=lat_sta,lon=lon_sta,method='nearest')
ano_series=np.concatenate((ano_var.values,np.array((0.0,)),(fcst_var1.sel(LAT=lat_sta, LON=lon_sta, method='nearest').values,)))
df =pd.DataFrame(ano_series, index=np_time, columns=['prec_ano'])
df=df.fillna(0)
df.to_csv(blend_outdir+sta_num+'.prec.csv')
end = time.time()
print('Task %s runs %0.2f seconds.' % (itsk, (end - start)))
Note:
NetCDF
operation is “lazy”, combining xarray operation like group
and multiprocessing will cause HDF5 IO errors. These conflicted operations should be excluded from the parallelized function.sel
: ~ 1 min; 4 tasks parallel: 10s.The principle for optimization:
Updated 2020-06-30
Imagine there is a mock planet, where there is only two seasons: Warm and Cold. The planet is so absurd that there is NO interannual variability in the Cold season, but Very Large interannual variability in the Warm season. Now we know the planet experiences a long-term constant warming trend, and we use a thermometer with instrumental bias obeying normal distribution ~N(b,sigma) to measure 100-yr surf temp of the planet in Cold and Warm seasons, respectively.
In which season the measured warming trend is more trustworthy? In the Warm season, the estimated trend has a lower signal-to-noise ratio as both interannual variability and instrumental bias contributes to the uncertainty. While in the Cold season, the only uncertainty is the instrumental bias ~N(b, sigma). Thus we trust the Cold season-observed long-term trend.
Furthermore, what if we consider the improvement of the instrumental measurement? In that scenario, \|b\| and sigma tends to 0 as a function of time. Given a constant long-term warming trend, we could naturally use the recent measured trend to calibrate the estimated trend in the old days.
I know this thinking is quite idealized even naive, but it inspired me, instead of simply using the annual mean value, using an episode mean value in the seasonal cycle with the smallest power in interannual spectral band, to estimate the long-term trend, could give us a larger signal-to-noise ratio. The ENSO signal just exhibits the smallest variability in boreal spring.
This idea comes from composing the 20C work. The above simple model was built when revising the IJC manuscript and discussed with Francis. The simple model could be helpful to understand the simple & idealized theory.
While the idea might be too difficult to express in the paper, and both the reviewers from Science Bulletin and IJC asked me why the springtime is so special.
Science Bulletin (rejected) Reviewer Comment: Why are the results different from Vecchi et al. (2006, doi:10.1038/nature04744)? Is it because Vecchi used annual data, but the authors focused on spring? Vecchi focused more on seal level pressure and surface wind stress, which are direct measures of the overturning circulation. This paper, however, mainly focuses on precipitation and clouds, which do not show the same information. For example, precipitation would increase with warming due to increasing water vapor, regardless of the circulation strength. Increasing cloud cover would not directly suggest stronger circulation, as there is litter information about the cloud types (deep vs. shallow). In addition, what makes spring so special?
IJC Reviewer Comment: The study points out the recent intensification of convection over WEP which is seasonally dependent. What lead to this result or why this seasonal dependence, which is not very clear to me? It would be more informative to the readers if the authors quantitatively completely elaborate this point further by describing these aspects clearly from the previous studies (e.g, Lee et al 2016; take any previous study showing the climatological seasonal cycle of WEP convection).
Reply:
At least two factors contribute to the special role of boreal spring in seasonal cycle. First, during the last several decades, precipitation strengthens most significantly in boreal spring over the WEP-MC. The positive trend of springtime rainfall is above 2.5 [mm day−1 (35yr)−1], accounting for more than 40% of the 35-year climatological MAM mean rainfall of the WEP-MC. In contrast, other seasons only show slightly increasing trends (Li et al., 2016). This fact serves to motivate us to investigate the question in the whole 20th Century. The second reason, which was added in L123–127, is that during boreal spring there is smaller observational uncertainty in the long-term climatic measurement. The following will explain how this interesting mechanism works.
Using the data sets in spring is a natural way to filter the uncertainty caused by inter-annual variability. We assume that the bias (residual sum of squares) of long-term trend estimation of climatic elements originates from two parts: one is the uncertainty caused by natural variability, and the other is the instrumental bias. Specifically, imagining a virtual world with only nearly constant long-term trend but no periodical natural oscillation, as the instrumental bias decreases as a function of time due to the improvement of technology, we can use the recent trend to calibrate the previous larger-biased measurement. Fig. R1 is a sketch to interpret the technique: the band within the dashed green lines shows the convinced long-term trend range.
Figure R1. A sketch to interpret the reduction in linear trend uncertainty in a virtual world without periodical natural variability
Therefore, if we can lock an episode in the seasonal cycle with the minimum natural oscillated variability, we are more confident to believe the long-term trend derived from the data within this episode. Next we show the boreal spring is the special episode in this study.
Fig. R2 displays the mean SLP and SLP differences between Tahiti and Darwin. As shown in the black boxes in Figs. R2a-b, springtime undertakes the lowest variation among the annual cycle. Using springtime mean values instead of simply annual mean values to calculate long-term trend, we can naturally filter the uncertainty contributed by natural oscillated variabilities. As the dominant inter-annual variability in the climate system is ENSO, the unique low variability feature of boreal spring comes from the seasonality of ENSO. Webster and Yang (1992) discussed the seasonality of the ENSO, and indicated that the spring is the ENSO developing or decaying season when the equatorial pressure gradient was the weakest.
Figure R2. Mean sea level pressure and difference of mean sea level pressure between Tahiti and Darwin (the shaded areas demonstrate the ranges within a ±1 standard deviation), after Yang et al. (2017).
Reference
Yang S, Deng K, and Duan W. The alternative interaction of the monsoon and ENSO: the effects of annual cycle and spring predictability barrier (in Chinese). Chinese Journal of Atmospheric Sciences, 2018. 42(3), 570-589
Updated 2020-06-20
Latex is something very useful that you only take 30 minute to learn and benefits from it through all your career. Here we hope to use Latex to replace MS word to create docs that need regular modifications, e.g. CV and bio.
\documentclass{article}
\begin{document}
First document. This is a simple example, with no
extra parameters or packages included.
\end{document}
The first line of code declares the type of document, known as the class. The class controls the overall appearance of the document. Different types of documents will require different classes i.e. a CV/resume will require a different class than a scientific paper.
Everything in your .tex file before \begin{document}
point is called the preamble. In the preamble you define the type of document you are writing, the language you are writing in, the packages you would like to use (more on this later) and several other elements.
\documentclass[12pt, letterpaper]{article}
\usepackage[utf8]{inputenc}
As for the paper size other possible values are a4paper and legalpaper. Encoding for the document, utf-8 is recommended.
\documentclass{article}
\usepackage[utf8]{inputenc}
\title{Sections and Chapters}
\author{Gubert Farnsworth}
\date{ }
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This is the first section.
Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Etiam lobortisfacilisis sem. Nullam nec mi et
neque pharetra sollicitudin. Praesent imperdietmi nec ante.
Donec ullamcorper, felis non sodales...
\addcontentsline{toc}{section}{Unnumbered Section}
\section*{Unnumbered Section}
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Etiam lobortis facilisissem. Nullam nec mi et neque pharetra
sollicitudin. Praesent imperdiet mi necante...
\section{Second Section}
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Etiam lobortis facilisissem. Nullam nec mi et neque pharetra
sollicitudin. Praesent imperdiet mi necante...
\end{document}
Updated 2020-06-20