之前在天河二号(包括学校超算集群)测试的F05的CAM,都没有出现什么问题,最近在天河二号上重新跑CAM5 F05,却出现了out of memory的错误,重新配置了CAM4,还是一样的错误:
Fatal error in PMPI_Alltoallv: Other MPI error, error stack: PMPI_Alltoallv(540)....................: MPI_Alltoallv(sbuf=0x806f43e0, scnts=0xe6c75e0, sdispls=0xe764390, dtype=0x4c000829, rbuf=0x805f7650, rcnts=0xe765ba0, rdispls=0xe7673b0, dtype=0x4c000829, comm=0x84000006) failed MPIR_Alltoallv_impl(378)...............: MPIR_Alltoallv(341)....................: MPIR_Alltoallv_intra(192)..............: MPIC_Waitall(720)......................: MPIR_Waitall_impl(163).................: MPIDI_CH3I_Progress(363)...............: MPID_nem_mpich_blocking_recv(906)......: MPID_nem_glex_poll(234)................: MPIDI_nem_glex_ER_progress_PUT(74).....: MPIDI_nem_glex_ER_recv_progress(378)...: MPID_nem_handle_pkt(635)...............: MPIDI_CH3_PktHandler_EagerSend(640)....: failure occurred while posting a receive for message data (MPIDI_CH3_PKT_EAGER_SEND)
重新配置PE层,增多节点数,调整MPIBUFFER,都没用。目前搁置。forum上有贴子:
https://bb.cgd.ucar.edu/node/1001941
eaton提到检查内存leak can be a very challenging exercise.
#Up to 20141108#
俊文寒假前将model统一升级到1.2.2后似乎没有这个问题了。不过我也退出折腾高分辨率啦,还是看paper学英语重要~嘿嘿
#Up to 20150519#
从模式原始输出中抽变量出来,采用默认输出方式输出50年全球(144x96)的ZMDT,程序会卡半天,而且最后时不时还会弹出write failed的错误。查了下,发现这种默认的输出方式非常低效(虽然不知道为什么),高效的方式是完全自行定义维数名称、属性等等。测试下速度确实快很多,而且不存在上面的问题,但是代码量增加了三四十行左右。兹举例如下:
48 ;************************************************ 49 ; Write the file 50 ;************************************************ 51 52 ;Get dimsize 53 dims=dimsizes(ctrl_var1p(0,:,:,:)) 54 nlvl=dims(0) 55 nlat=dims(1) 56 nlon=dims(2) 57 58 ;Set fileoption 59 system("rm "+pdata_fn) 60 fout = addfile(pdata_fn ,"c") ; open output netCDF file 61 setfileoption(fout,"DefineMode",True) 62 63 ;Set All field attribution 64 fileAtt = True 65 fileAtt@creation_date=systemfunc("date") 66 fileattdef(fout,fileAtt) 67 68 ;Define Coordinate 69 dimNames =(/"time","lev_p","lat","lon"/) 70 dimSizes =(/-1,nlvl,nlat,nlon/) 71 dimUnlim =(/True,False,False,False/) 72 filedimdef(fout,dimNames,dimSizes,dimUnlim) 73 74 ;Define var, type and dim 75 filevardef(fout,"time",typeof(ctrl_var1p&time),getvardims(ctrl_var1p&time)) 76 filevardef(fout,"lev_p",typeof(ctrl_var1p&lev_p),getvardims(ctrl_var1p&lev_p)) 77 filevardef(fout,"lat",typeof(ctrl_var1p&lat),getvardims(ctrl_var1p&lat)) 78 filevardef(fout,"lon",typeof(ctrl_var1p&lon),getvardims(ctrl_var1p&lon)) 79 filevardef(fout,"ZMDT",typeof(ctrl_var1p),getvardims(ctrl_var1p)) 80 81 ;Define Attribute 82 filevarattdef(fout,"ZMDT",ctrl_var1p) 83 84 fout->time=(/ctrl_var1p&time/) 85 fout->lev_p=(/ctrl_var1p&lev_p/) 86 fout->lat=(/ctrl_var1p&lat/) 87 fout->lon=(/ctrl_var1p&lon/) 88 fout->ZMDT=(/ctrl_var1p/)
#Up to 20141108#
这种输出模式下,注意应到将时间维和坐标维度单列变量输出到nc文件,坐标维度应当设置degree定义方向,否则绘图时需要重设。
另外这种方法大部分时间还是比较高效的,但是有时会出现奇怪的远远超出一般用时的卡顿,在数据量较大时容易出现。
#Up to 20141109#
邮件收到了继续返工的消息,旁边又开始endless whisper,烦躁而且压抑,受不了只能戴上耳机听Yanni。a walk in the rain,之前没有注意到的一首,还是很不错的。HP同学推荐,说是感觉像是走在雨后的草地上,嗯,热情而阳光。
光腚总菊一声令下,MP3都下不了咯,贴个链接:
http://www.baidu.com/s?wd=a%20walk%20in%20the%20rain&ie=utf-8&f=8&rsv_bp=1&tn=monline_5_dg&rsv_pq=f40a682a0000744f&rsv_t=782cZVhHu0LhKuwkr029%2Bsc1%2B%2BBZDvR19ANVw8dNJrzqLk7nxSrt&bs=a%20walk%20in%20%E5%A4%A9%E6%B2%B3