Welcome to LZN's Blog!

Wind extinguishes a candle but energizes fire.

Start to Build SCREAM onto CUHK Cluster

2020-07-11

LZN

technology science

Old version test
New Version Passed!

Old version test

Newly updated SCREAM repo removed the readme file about how to port the code, but the older version of readme can be fetched by tracing back the code histroy.

The first problem comes from buiding kokkos:

CMake Error at cmake/kokkos_functions.cmake:64 (MESSAGE):
  Matching option found for Kokkos_ENABLE_SERIAL with the wrong case
  KOKKOS_ENABLE_SERIAL.  Please delete your CMakeCache.txt and change option
  to -DKokkos_ENABLE_SERIAL=ON.  This is now enforced to avoid hard-to-debug
  CMake cache inconsistencies.

This error is basically hoping you use camel-like cases in the command line to set the configuring flags. Just follow it suggests.

The command changes to:

cmake \
    -D CMAKE_INSTALL_PREFIX=${RUN_ROOT_DIR}/kokkos/install \
    -D CMAKE_BUILD_TYPE=Debug \
    -DKokkos_ENABLE_DEBUG=ON \
    -DKokkos_ENABLE_AGGRESSIVE_VECTORIZATION=OFF \
    -DKokkos_ENABLE_SERIAL=ON \
    -DKokkos_ENABLE_OPENMP=ON \
    -DKokkos_ENABLE_PROFILING=OFF \
    -DKokkos_ENABLE_DEPRECATED_CODE=OFF \
    -DKokkos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=OFF \
    ${KOKKOS_SRC_LOC}

Here we can successfully install the Kokkos. In building the SCREAM, several errors appear:

CMake Error: File /users/b145872/project-dir/app/scream/components/scream/../cam/src/physics/rrtmgp/external/rrtmgp/data/rrtmgp-data-lw-g224-2018-12-04.nc does not exist.
CMake Error: File /users/b145872/project-dir/app/scream/components/scream/../cam/src/physics/rrtmgp/external/rrtmgp/data/rrtmgp-data-sw-g224-2018-12-04.nc does not exist.

This file can be easily download by a simple search. Here I just found something not right. It seems some modules are missing in the scream folder. I found they locate in the external folder, and we can see the folder structure on github, but has not been cloned to local path.

That is interesting! I then found these folders on github actually point to other repos. It is [git submodule](https://git-scm.com/book/en/v2/Git-Tools-Submodules)!

The right way to clone the repo with submodules:

git clone --recurse-submodules

Now it comes to the MPI issue (cannot find mpi.h). Add INCLUDE path to .bashrc.

export CPLUS_INCLUDE_PATH=$INCLUDE
export C_INCLUDE_PATH=$INCLUDE

New error:

/users/b145872/project-dir/app/scream/components/scream/ekat/src/ekat/util/scream_arch.cpp(34): error: class "Kokkos::Serial" has no member "impl_is_initialized"
    ss << "ExecSpace initialized: " << (DefaultDevice::execution_space::impl_is_initialized() ? "yes" : "no") << "\n";

We then found the Kokkos default settings are not what we want, Seriel is not acceptable.

-- Final kokkos settings variable:
--   env;KOKKOS_CMAKE=yes;KOKKOS_SRC_PATH=/users/b145872/project-dir/app/scream/externals/kokkos;KOKKOS_PATH=/users/b145872/project-dir/app/scream/externals/kokkos;KOKKOS_INSTALL_PATH=/users/b145872/project-dir/app/scream_run/scream_test01/kokkos/install;KOKKOS_ARCH=None;KOKKOS_DEVICES=Serial;KOKKOS_DEBUG=no;KOKKOS_OPTIONS=disable_dualview_modify_check;KOKKOS_USE_TPLS=librt

We then re-source the ~/.bashrc. It seems the configure grep the MPI settings now, with OpenMP as the Parallel settings. New error in configuring SCREAM:

CMake Error at /users/b145872/project-dir/app/scream/externals/kokkos/cmake/kokkos_functions.cmake:49 (MESSAGE):
  Matching option found for Kokkos_ENABLE_DEBUG with the wrong case
  Kokkos_ENABLE_Debug.  Please delete your CMakeCache.txt and change option
  to -DKokkos_ENABLE_DEBUG=FALSE.  This is now enforced to avoid
  hard-to-debug CMake cache inconsistencies.

This is weird. In configuring Kokkos, we give excatly “-DKokkos_ENABLE_DEBUG=ON ", and in configuring SCREAM, there is no such option. Using grep, we found several “Debug” in components/scream/ekat/cmake/Kokkos.cmake, after changing to “DEBUG”, we pass this point…

New error [26%]:

/users/b145872/project-dir/app/scream/components/scream/ekat/src/ekat/scream_kokkos_meta.hpp(18): error: class "Kokkos::MemoryTraits<0U>" has no member "RandomAccess"
      value = ((View::traits::memory_traits::RandomAccess ? Kokkos::RandomAccess : 0) | 

According to kokkos programming guide, the RandomAccess is a trait from CUDA. It seems we also have to build cuda for SCREAM? If this is true, the SCREAM is required to run on GPU nodes. Okay, just a simple test, let us first see what would happen if we build kokkos in seriel mode.

The seriel mode shows similar errors. Besides, I noticed it seems all memory access operation including “Atomic Access” traits are missing, thus it may not be simply “need cuda” issues. Interesting.

When I used the latest kokkos from github, different errors occur when try “Seriel” and “OPENMP”. When only “Seriel” is turned on, MPI errors will occur in compiling SCREAM.

New Version Passed!

On Jul 20, I found a new version of master branch and following the new instructions in build.md, the test can be finished successfully.

When I turn on CUDA with cuda version 10.1, there are still problems.

Read All

Try the Community Development Tool on GitHub

2020-07-10

LZN

technology

github pull-request branch merge
Here we show an simple example about how to use community development commands in GitHub.

We need to first fork the community repo, then clone it to our local machine. On the local machine, we can create our own branch:
```
git checkout -b test-brch
```
After modification:
```
git add .
git commit -m "test-brch"
git push origin test-brch
```
Now we push the revised branch onto our remote GitHub repo. Go to that repo and change to the branch, submit the pull request to the original repo. Then the community admin will see our pull request.

If we hope to merge the branch locally:
```
git checkout master # checkout to master branch
git merge test-brch
```
Now it is safe to delete the test-brch branch.
```
git branch -d test-brch
```
Updated 2020-07-10
Read All

Using Multiprocessing to Increase Multiple Files IO Speed in Python

2020-06-27

LZN

technology

python multiprocessing

In the Spellcaster! project, we combined observational analysis data and s2s forecast data to form the forecast inputs. By the default scratch code I wrote, the process is very slow. For 2000 stations, it took nearly 30 minutes to complete the combination. I speculated the bottleneck is in the IO, so here I try to use multiprocessing in python to increase the IO speed.

The orginal hotspot code:

for idx, row in sta_df.iterrows():
    sta_num=str(int(row['区站号']))
   # print(sta_num+' '+row['省份']+' '+row['站名'])
    lat_sta=conv_deg(row['纬度(度分)'][0:-1])
    lon_sta=conv_deg(row['经度(度分)'][0:-1])
    var=var1.sel(lat=lat_sta,lon=lon_sta,method='nearest')
    clim_var = var.loc['1981-01-01':'2010-12-31'].groupby("time.month").mean()
    ano_var = (var.groupby("time.month") - clim_var)
    ano_series=np.concatenate((ano_var.values,np.array((0.0,)),(fcst_var1.sel(LAT=lat_sta, LON=lon_sta, method='nearest').values,)))
    np_time=np.append(hist_time.values, np.datetime64('now'))
    np_time=np.append(np_time, fcst_time.values)
    df =pd.DataFrame(ano_series, index=np_time, columns=['prec_ano'])
    df=df.fillna(0)
    df.to_csv(blend_outdir+sta_num+'.prec.csv') 

Using multiprocessing and rewite this part, the main function:

def main():
    # number of processes in use
    ntasks=4

    # PREC/L data
    ds = xr.open_dataset(prec_arch_fn)
    var1 = ds['precip'].loc['1979-01-01':,:,:]
    hist_time= ds['time'].loc['1979-01-01':]
    #print(var1.loc['1981-01-01':'2010-12-31',:,:])

    clim_var1 = var1.loc['1981-01-01':'2010-12-31'].groupby("time.month").mean()
    ano_var1 = (var1.groupby("time.month") - clim_var1)

    #S2S data
    ds_s2s = xr.open_dataset(s2s_fcst_file)
    fcst_var1=ds_s2s['anom'][0,0,0,:,:]
    fcst_time=ds_s2s['TIME']
    
    np_time=np.append(hist_time.values, np.datetime64('now'))
    np_time=np.append(np_time, fcst_time.values)
    

    # Get in Station meta
    sta_df=get_station_df(sta_meta_file)
        
    print('Parent process %s.' % os.getpid())
    
    # start process pool
    process_pool = Pool(ntasks)
    
    len_df=sta_df.shape[0]
    len_per_task=len_df//ntasks

    # open tasks ID 0 to ntasks-2
    for itsk in range(ntasks-1):   
        process_pool.apply_async(combine_data, args=(itsk, sta_df[itsk*len_per_task:(itsk+1)*len_per_task], ano_var1, fcst_var1, np_time, blend_outdir,))

    # open ID ntasks-1 in case of residual
    process_pool.apply_async(combine_data, args=(ntasks-1, sta_df[(ntasks-1)*len_per_task:], ano_var1, fcst_var1, np_time, blend_outdir,))
    print('Waiting for all subprocesses done...')
   
    process_pool.close()
    process_pool.join()
    print('All subprocesses done.')

The parallelized function:

def combine_data(itsk, sta_df, ano_var1, fcst_var1, np_time, npblend_outdir):
    print('Run task %s (%s)...' % (itsk, os.getpid()))
    start = time.time()
    
    for idx, row in sta_df.iterrows():
        sta_num=str(int(row['区站号']))
        # print(sta_num+' '+row['省份']+' '+row['站名'])
        lat_sta=conv_deg(row['纬度(度分)'][0:-1])
        lon_sta=conv_deg(row['经度(度分)'][0:-1])
        ano_var=ano_var1.sel(lat=lat_sta,lon=lon_sta,method='nearest')
        ano_series=np.concatenate((ano_var.values,np.array((0.0,)),(fcst_var1.sel(LAT=lat_sta, LON=lon_sta, method='nearest').values,)))
        df =pd.DataFrame(ano_series, index=np_time, columns=['prec_ano'])
        df=df.fillna(0)
        df.to_csv(blend_outdir+sta_num+'.prec.csv') 
    
    end = time.time()
    print('Task %s runs %0.2f seconds.' % (itsk, (end - start)))

Note:

Some xarray operations cannot be parallelized, as the NetCDF operation is “lazy”, combining xarray operation like group and multiprocessing will cause HDF5 IO errors. These conflicted operations should be excluded from the parallelized function.
The original code is not efficient, as the xarray operations will be repeatedly called over a large array. This is the true hot spot. As a comparison: original code ~ 30 min; optimizing anomaly calculation before repeating sel: ~ 1 min; 4 tasks parallel: 10s.

The principle for optimization:

Hot spot may comes from highly encapsulated operations
Make the parallel section as simple, as elemental as possible

Updated 2020-06-30