Skip to content

Potential memory leak on reloading model #72

@jriegner

Description

@jriegner

Hey guys,

We use tfgo and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.

The code to load the model is

// load model into memory
	model := tg.LoadModel(
		"path/to/our/model",
		[]string{
			"serve",
		},
		&tf.SessionOptions{},
	)

But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with pprof and could not see that any of the internal components in our code has a significantly growing memory usage.

Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with memleak-bpfcc from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leaked

	1770048 bytes in 9219 allocations from stack
		operator new(unsigned long)+0x19 [libstdc++.so.6.0.28]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::New(google::protobuf::Arena*)+0x1c [libtensorflow_framework.so.2]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::NewFromPrototype(tensorflow::NodeDef const*, google::protobuf::Arena*)+0x20 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type*)+0xc2 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::Add()+0x21 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::add_node_def()+0x20 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x334 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDef>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::FunctionDefLibrary::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x240 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDefLibrary>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDefLibrary*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::GraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x291 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::GraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::GraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::MetaGraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x325 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::MetaGraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::MetaGraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::SavedModel::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x25b [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::MergeFromCodedStream(google::protobuf::io::CodedInputStream*)+0x32 [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::ParseFromCodedStream(google::protobuf::io::CodedInputStream*)+0x3e [libtensorflow_framework.so.2]
		tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)+0x141 [libtensorflow_framework.so.2]
		tensorflow::(anonymous namespace)::ReadSavedModel(absl::lts_20211102::string_view, tensorflow::SavedModel*)+0x136 [libtensorflow_framework.so.2]
		tensorflow::ReadMetaGraphDefFromSavedModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::MetaGraphDef*)+0x5d [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModelInternal(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0x41 [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModel(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0xc0 [libtensorflow_framework.so.2]
		TF_LoadSessionFromSavedModel+0x2a8 [libtensorflow.so]
		_cgo_6ae2e7a71f9a_Cfunc_TF_LoadSessionFromSavedModel+0x6e [testapp]
		runtime.asmcgocall.abi0+0x64 [testapp]
		github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel.abi0+0x4d [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2+0x14f [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel+0x2b6 [testapp]
		github.com/galeone/tfgo.LoadModel+0x6d [testapp]
		main.reloadModel+0x276 [testapp]
		main.main+0x72 [testapp]
		runtime.main+0x212 [testapp]
		runtime.goexit.abi0+0x1 [testapp]


As you can see this stacktrace shows calls to tfgo and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak in tfgo or tensorflow itself.

Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in tfgo?
If you need more information on this, please tell me.

Thanks in advance :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions