Here we show our first “hello world” programm with tensorflow on chpc GPU node. Envirment:
import tensorflow as tf
import numpy as np
# use mnist data
mnist = tf.keras.datasets.mnist
print('mnist.load_data')
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# normalize data
x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)
# sequential network
model = tf.keras.models.Sequential()
# input layer
model.add(tf.keras.layers.Flatten())
# hidden layers
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
# output layer
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
print('model.fit')
model.fit(x_train,y_train, epochs=3)
val_loss, val_acc = model.evaluate(x_test, y_test)
print(val_loss, val_acc)
model.save('epic_num_reader.model')
new_model=tf.keras.models.load_model('epic_num_reader.model')
predictions = new_model.predict(np.array(x_test))
print(np.argmax(predictions[0]))
The execution flow without hardware info:
Epoch 1/3
2020-05-08 22:14:25.772225: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
60000/60000 [==============================] - 6s 93us/sample - loss: 0.2649 - acc: 0.9225
Epoch 2/3
60000/60000 [==============================] - 5s 85us/sample - loss: 0.1056 - acc: 0.9682
Epoch 3/3
60000/60000 [==============================] - 5s 86us/sample - loss: 0.0721 - acc: 0.9769
10000/10000 [==============================] - 1s 59us/sample - loss: 0.0908 - acc: 0.9721
0.09084201904330402 0.9721
7
The predicted handwritting figure is “7”.
Actually, we found the GPU version of tf in this case is slower than the CPU version. This may due to the scalebility issue.
Updated 2020-05-08
Start to use python to deal with the WRF output. xarray is a very important package to deal with NetCDF
and HDF5
data.
As WRF files are not exactly CF compliant: you’ll need a special parser for the timestamp, the coordinate names are a bit exotic and do not correspond to the dimension names, they contain so-called staggered variables (and their correponding coordinates), etc.
salem is needed to parser wrf data. This is useful to slice wrf data in xarray.
ds=ds.sel(time=slice('2018-09-15','2018-09-17'))
Updated 2020-05-03
Please scroll to the FINAL for Ultimate Solution!
Follow the procedure to install the libs and dependencies.
In the last step, testing the installation:
Running tests under Python 3.6.10: /users/b145872/project-dir/Anaconda3_2020/envs/tf1.14/bin/python
[ RUN ] ModelBuilderTest.test_create_experimental_model
[ OK ] ModelBuilderTest.test_create_experimental_model
[ RUN ] ModelBuilderTest.test_create_faster_rcnn_model_from_config_with_example_miner
[ OK ] ModelBuilderTest.test_create_faster_rcnn_model_from_config_with_example_miner
[ RUN ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_with_matmul
[ OK ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_with_matmul
[ RUN ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_without_matmul
[ OK ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_faster_rcnn_without_matmul
[ RUN ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_mask_rcnn_with_matmul
[ OK ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_mask_rcnn_with_matmul
[ RUN ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_mask_rcnn_without_matmul
[ OK ] ModelBuilderTest.test_create_faster_rcnn_models_from_config_mask_rcnn_without_matmul
[ RUN ] ModelBuilderTest.test_create_rfcn_model_from_config
[ OK ] ModelBuilderTest.test_create_rfcn_model_from_config
[ RUN ] ModelBuilderTest.test_create_ssd_fpn_model_from_config
[ OK ] ModelBuilderTest.test_create_ssd_fpn_model_from_config
[ RUN ] ModelBuilderTest.test_create_ssd_models_from_config
[ OK ] ModelBuilderTest.test_create_ssd_models_from_config
[ RUN ] ModelBuilderTest.test_invalid_faster_rcnn_batchnorm_update
[ OK ] ModelBuilderTest.test_invalid_faster_rcnn_batchnorm_update
[ RUN ] ModelBuilderTest.test_invalid_first_stage_nms_iou_threshold
[ OK ] ModelBuilderTest.test_invalid_first_stage_nms_iou_threshold
[ RUN ] ModelBuilderTest.test_invalid_model_config_proto
[ OK ] ModelBuilderTest.test_invalid_model_config_proto
[ RUN ] ModelBuilderTest.test_invalid_second_stage_batch_size
[ OK ] ModelBuilderTest.test_invalid_second_stage_batch_size
[ RUN ] ModelBuilderTest.test_session
[ SKIPPED ] ModelBuilderTest.test_session
[ RUN ] ModelBuilderTest.test_unknown_faster_rcnn_feature_extractor
[ OK ] ModelBuilderTest.test_unknown_faster_rcnn_feature_extractor
[ RUN ] ModelBuilderTest.test_unknown_meta_architecture
[ OK ] ModelBuilderTest.test_unknown_meta_architecture
[ RUN ] ModelBuilderTest.test_unknown_ssd_feature_extractor
[ OK ] ModelBuilderTest.test_unknown_ssd_feature_extractor
----------------------------------------------------------------------
Ran 17 tests in 0.427s
Now we test the object detection script. As the model need to run on GPU cluster, we cannot simply use jupyter notebook. So convert to plain python code.
jupyter nbconvert --to python object_detection_tutorial.ipynb
Excecute the python code, got:
ModuleNotFoundError: No module named 'object_detection'
Compile protobufs and install the object_detection package:
cd models/research/
protoc object_detection/protos/*.proto --python_out=.
pip install .
Execute again, got:
tensorflow.python.framework.errors_impl.NotFoundError: models/research/object_detection/data/mscoco_label_map.pbtxt; No such file or directory
This is the relative path problem caused by changing directory from ipython notebook.
Execute again, got:
File "object_detection_tutorial.py", line 98, in <module>
detection_model = load_model(model_name)
File "object_detection_tutorial.py", line 62, in load_model
model = tf.saved_model.load(str(model_dir), None)
File "/users/b145872/project-dir/Anaconda3_2020/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
TypeError: load() missing 1 required positional argument: 'export_dir'
Version Change:
model = tf.saved_model.load(export_dir=str(model_dir))
# change to:
model = tf.saved_model.load_v2(export_dir=str(model_dir),tags=None)
New error:
Traceback (most recent call last):
File "object_detection_tutorial.py", line 191, in <module>
show_inference(detection_model, image_path)
File "object_detection_tutorial.py", line 172, in show_inference
output_dict = run_inference_for_single_image(model, image_np)
File "object_detection_tutorial.py", line 141, in run_inference_for_single_image
num_detections = int(output_dict.pop('num_detections'))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'
…….
After multiple tests, we still cannot run the script on gpu smoothly, tf1.9 1.14 and 2.1 all failed. When I come back to the github page. I found the updated ipynb
…
And this time, the tf2.1-based env can run it with GPU! Although there are still errors, we reinstalled the tf2.1 by conda --force-reinstall
, everything goes nice!
Updated 2020-05-10