Feature/python.paddle.v2 by reyoung · Pull Request #1108 · PaddlePaddle/Paddle

reyoung · 2017-01-10T07:20:14Z

Rearrange Paddle Packages to paddle.v2. Use import paddle.v2 as paddle currently.

jacquesqiao · 2017-01-10T07:31:58Z

 import random
+
+import paddle.v2 as paddle
+import py_paddle.swig_paddle as api


我总觉得这个swig_paddle需要稍微封装一下

看到下边已经改了

jacquesqiao · 2017-01-10T07:33:18Z

-    inference = fc_layer(input=hidden2, size=10, act=SoftmaxActivation())
-    cost = classification_cost(
-        input=inference, label=data_layer(
+    imgs = paddle.config.data_layer(name='pixel', size=784)


network的config和上边那些optimizer的config是不是分开比较好

其实都是生成protobuf的东西。先放到一个包里面了。

jacquesqiao · 2017-01-10T07:36:14Z

          'paddle.trainer_config_helpers',
-          'paddle.utils']
+          'paddle.utils',
+	  'paddle.v2']


貌似有个缩进的问题

jacquesqiao · 2017-01-11T11:45:39Z

lgtm

wangkuiyi

之前我们讨论说先用目前的“不需要定义多个.py文件“的API把所有demo都重写一遍，然后再来看应该如何完善API。

但是我刚才又想了一下，是不是API里如果有一些明显的问题，可以先修正问题，然后再来重写mnist之后的下一个demo。这样效率更高？

在写mnist的时候，我们不要 import * from 已有的Python packages，而是 copy-n-paste 已有的package 到 paddle.v2。这样我们就可以在”把mnist demo写得顾名思义“这个过程里，修改copy 过来的实现。当我们针对每个demo重复这个过程之后，我们是不是就得到了一个完备的v2 API了。

wangkuiyi · 2017-01-11T23:12:47Z

+    opt_config_proto = paddle.config.parse_optimizer(optimizer_config)
+    opt_config = paddle.raw.OptimizationConfig.createFromProto(opt_config_proto)
+    _temp_optimizer_ = paddle.raw.ParameterOptimizer.create(opt_config)
    enable_types = _temp_optimizer_.getParameterTypes()


这一段没看明白——我们需要optimizer吗？为什么只create了一个temproary optimzier，然后取了其中一个property（enable_types），用来生成一个 gradient machine 就完了。我们到底需不需要optimizer这个概念呢？

不需要optimizer这个概念。

wangkuiyi · 2017-01-11T23:14:48Z

-    _temp_optimizer_ = api.ParameterOptimizer.create(opt_config)
+    opt_config_proto = paddle.config.parse_optimizer(optimizer_config)
+    opt_config = paddle.raw.OptimizationConfig.createFromProto(opt_config_proto)
+    _temp_optimizer_ = paddle.raw.ParameterOptimizer.create(opt_config)


L73 到 L75 看上去是要按照根据一些config信息创建一个optimizer？如果是这样为什么需要 protobuf message 和一个高阶函数 parse_optimizer 呢？是不是下面这样就可以了：

optimizer = paddle.v2.optimizer.Adam( learning_rate=1e-4, batch_size=1000, model_averager=paddle.v2.config.ModelAverage(average_window=0.5), regularizator=paddle.v2.regularizer.L2(rate=0.5))

这样也不需要定义 def optimizer_config() 这个函数了。

当然，这样更好些。

wangkuiyi · 2017-01-11T23:26:40Z

-        model_config, api.CREATE_MODE_NORMAL, enable_types)
+    model_config = paddle.config.parse_network(network_config)
+    m = paddle.raw.GradientMachine.createFromConfigProto(
+        model_config, paddle.raw.CREATE_MODE_NORMAL, enable_types)


类似的，如果这里只是要create一个gradient mahcine，貌似应该是：

images = paddle.v2.layer.data(name='pixel', size=784) hidden1 = paddle.v2.layer.fc(input=images, size=200) hidden2 = paddle.v2.layer.fc(input=hidden1, size=200) classes = paddle.v2.layer.fc(input=hidden2, size=10, act=paddle.config.SoftmaxActivation()) cost = paddle.v2.cost.classification( input=classes, label=paddle.v2.layer.data(name='label', size=10)) gm = paddle.gradient_machine.create(cost)

这样也不需要定义 network_config 函数了？

这样实现，Paddle需要比较大规模的重构。

主要是现在Paddle解析网络配置的过程，是调用fc之类的函数，在这个函数里面写了一个全局的变量。
而传cost的方式，相当于在这个cost变量里，记录了网络的所有拓扑。这需要我们的返回值记录原来全局变量中的信息。

并且，其实还有一个问题，就是有可能一个神经网络会有多个输出值。可能不只有一个cost。使用传函数的办法也并不算非常不优雅。

在目前config_parser的实现里边，还必须是一个函数，或者是一个独立的文件，而且这里的改造代价比较大。https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer/config_parser.py#L3444
这是一个几千行代码的文件，后续应该是要重构掉的。

@wangkuiyi 如果确定传变量的方式没有问题的话，那么我们要不就重构一下这个解析了？

wangkuiyi · 2017-01-11T23:32:32Z

    # This type check is not useful. Only enable type hint in IDE.
    # Such as PyCharm
-    assert isinstance(m, api.GradientMachine)
+    assert isinstance(m, paddle.raw.GradientMachine)


我理解如果 m/gm 是上面一行 gm = paddle.v2.gradient_machine.create(...) 这样产生的，读者自然相信 gm 的类型是一个 gradient machine，也就不需要这个 assertation 来增加程序的可读性了。

这行完全可以不需要。主要是为IDE增加的type信息。常见的IDE可以根据这个type信息做代码提示。

写这行的主要原因是我们用swig生成的python代码。swig并没有按照Python的风格注释函数返回值的类型。如果我们从C-API写的话，就没这个问题了。

wangkuiyi · 2017-01-11T23:35:35Z


    # Initialize Parameter by numpy.
-    init_parameter(network=m)
+    m.randParameters()


randParameters是一个函数吗？按照Python style guide，函数名应该是 randomize_parameters。

是。但这里因为使用的是SWIG编译的C++头文件，所以还是按照C++的命名方式了。如果用C-API暴露，就没有这个问题了。

randParameters是从cpp文件中通过swig暴露出来的接口

wangkuiyi · 2017-01-11T23:48:46Z

    # in future.
-    updater = api.ParameterUpdater.createLocalUpdater(opt_config)
-    assert isinstance(updater, api.ParameterUpdater)
+    updater = paddle.raw.ParameterUpdater.createLocalUpdater(opt_config)


updater = paddle.v2.parameter_updater.create( optimizer = paddle.v2.optimizer.Adam(), learning_rate=1e-4, batch_size=1000, model_averager=paddle.v2.config.ModelAverage(average_window=0.5), regularizator=paddle.v2.regularizer.L2(rate=0.5))

没问题。有道理。

wangkuiyi · 2017-01-11T23:49:22Z

-    updater = api.ParameterUpdater.createLocalUpdater(opt_config)
-    assert isinstance(updater, api.ParameterUpdater)
+    updater = paddle.raw.ParameterUpdater.createLocalUpdater(opt_config)
+    assert isinstance(updater, paddle.raw.ParameterUpdater)


类似的，如果 updater 是上面一行那样create的，读者应该不依赖这一行来了解updater的类型了。

wangkuiyi · 2017-01-11T23:51:37Z

    # Input. The input format is as same as Paddle's DataProvider.
-    converter = DataProviderConverter(
-        input_types=[dp.dense_vector(784), dp.integer_value(10)])
+    converter = paddle.data.DataProviderConverter(input_types=[


这里为什么需要暴露一个 converter 出来呢？按照上面comment的解释——把一个Python obj转换为一个C++ obj——这个样的逻辑不应该出现在API里，暴露给用户吧？

是的，可以封装到数据的Iterator里面。

wangkuiyi · 2017-01-11T23:54:57Z

@@ -0,0 +1,12 @@
+from paddle.trainer_config_helpers import *


这里 import *，让我们丧失了对 paddle.v2 package 里的内容的掌握——只要有人修改了 paddle.trianer_config_helpers 里的内容，这里的symbols就发生变化了吧？

我建议，我们趁此机会，先把 mnist 这个 demo 需要的内容 copy-n-paste 过来。然后依据把 mnist demo 写的让读者能“顾名思义”的原则，修改 copy 过来的库。

随后我们一个一个demo的过，重复上述过程，得到的 paddle.v2 应该就是我们想要的了吧。

不过其实paddle.trainer_config_helpers里面暴露的符号是严格控制的。里面使用了__all__来控制符号暴露。

copy and paste这个包，会少暴露非常多东西。比如，我们教程里面的MNIST可能使用全连接做的。用户可能想改成卷积之类的操作。但是，如果只是copy and paste demo需要的接口的话，卷积很可能就没复制过去。这用户就缺乏了这部分灵活性了。

同时，和之前的一个comment类似，如果我们真的需要使用『返回值』而不是『函数』来去定义网络结构的话，那其实所有的配置解析都要重写一下。copy and paste反而不好，不如直接重写一个解析过程。

我建议copy-n-paste，就是为了“重写”，而不只是为现有symbols在v2 package下面建立一个link。

emailweixu · 2017-01-12T00:19:27Z

@@ -178,7 +170,7 @@ def main():
        test_data_generator = input_order_converter(read_from_mnist(test_file))
        for data_batch in generator_to_batch(test_data_generator, 512):


如果能写成下面这样会更容易让用户理解：
data_reader = read_from_mnist(test_file)
for data_batch in data_reader:
m.forward(data_batch, outArgs, paddle.raw.PASS_TEST)
...

input_order_converter 和 converter这两个东西会让用户难以理解。用户能做的多半只是照抄代码，但是并不知道为什么要用这些converter. 前面创建optimizer的那段代码也是同样的问题。

一个好的API最好能做到每一步调用都让用户很容易理解为什么要做。

有道理。好的。

这两个可以直接patch到forward这个函数里面。

change network config in mnist/api_trian.py to v2

reyoung · 2017-01-17T05:56:13Z

+        yield items[:min(i + 1, size)]
+
+
+class IDataPool(object):


@jacquesqiao @wangkuiyi

Please review this interface.

这个是Paddle API返回数据的接口。只有两个函数。另外，其实在Python里面，没有『接口』这个概念。这里写一下只是为了code清晰一点。

wangkuiyi

两个感觉：

over engineering -- 有些classses和functions在我看来是不需要的。
新增的概念，例如Layer和Model和DataPool是没有commets来说明“用法”或者说“要支持的语法”的。
这个PR越长越大，目前10个文件了，不知道将来会有多少个文件。一般一个PR如果不想把reviewer的脑袋撑爆，3个文件比较合适。这个PR如果拆成多个——比如Model一个，Layer一个，DataPool一个，例子一个，会比较合适。

wangkuiyi · 2017-01-18T06:55:50Z

@@ -0,0 +1,12 @@
+from paddle.trainer_config_helpers import *


我建议copy-n-paste，就是为了“重写”，而不只是为现有symbols在v2 package下面建立一个link。

wangkuiyi · 2017-01-18T06:57:41Z

+]
+
+
+class IDataPool(object):


如注释里说的，这个“接口”意义不大。如果要教会用户如何写一个data pool，貌似只需要两行注释：

# To create your data pool, please define a class with two methods: next, which returns a batch of data, and reset, which resets the pool.

wangkuiyi · 2017-01-18T06:58:10Z

+        yield tmp
+
+
+class NaiveDataPool(IDataPool):


如果意思是"load all samples into memory"，那就叫 InMemoryDataPool 好了。

wangkuiyi · 2017-01-18T07:00:01Z

+        return self.__pool__[begin:end]
+
+
+def create_data_pool(file_reader,


不同类型的data pool可能有各自不同的配置参数，不需要统一成同一组。我理解，用户自己调用 InMemoryDataPool 的构造函数创建一个 pool 就行，不需要 create_data_pool 这样一个函数。这样有违最简化的原则。

wangkuiyi · 2017-01-18T07:08:52Z

+import collections
+
+
+class Layer(object):


这里应该有一些注释，说明Layer应该怎么被使用。

之前 @hedaoyuan 在 review Error（Status）那个PR的时候提醒过需要预先定义语法（syntax）。我觉得非常有道理。

wangkuiyi · 2017-02-02T19:32:47Z

-        for offset in xrange(0, len(self.data), self.batch_size):
-            limit = min(offset + self.batch_size, len(self.data))
-            yield self.data[offset:limit]
+    batch_evaluator = model.make_evaluator()


这两个evaluator看上去比较confusing。这里没有参数，所以看上去两个evaluator的功能是一样的。后面调用了.start和 .finish，但是也看不出来具体在干什么。根据名字（evaluator），猜测是用某种test data来评测模型，但是也看不出来用的是什么test data？

wangkuiyi · 2017-02-02T19:33:33Z

-    for each_item in generator:
-        yield each_item['pixel'], each_item['label']
+    # Training process.
+    model.start()


model不能start吧？这里想做的是 model.start_training? 在training开始之前需要做什么呢？

wangkuiyi · 2017-02-02T19:38:55Z

+    model.start()

+    for pass_id in xrange(2):
+        model.start_pass()


好像我们之前讨论过，不应该由用户控制pass这个概念，因为pass是由数据集合大小决定的？我隐约记得之前讨论过，可以写成如下方式：

for minibatch, last_batch in enumerate(training_data): .... if last_batch: log("Print model quality: %f", evaluate(model, testing_data)) ....

wangkuiyi · 2017-02-02T19:39:17Z

+        model.finish_pass()

-    m.finish()
+    model.finish()


model.finish 要做什么呢？

reyoung · 2017-03-09T05:43:38Z

Closed because this PR's job is done by ourselves in last month.

reyoung added 2 commits January 10, 2017 14:19

Init commit

07fe0ee

Simple extract paddle.v2 api.

1935b34

reyoung requested review from backyes, beckett1124, jacquesqiao and wangkuiyi and removed request for backyes and beckett1124 January 10, 2017 07:20

jacquesqiao reviewed Jan 10, 2017

View reviewed changes

jacquesqiao requested changes Jan 10, 2017

View reviewed changes

reyoung added 2 commits January 10, 2017 22:15

Follow comments

da97042

Add proto to paddle.v2

823eb1f

reyoung mentioned this pull request Jan 11, 2017

Feature/intro api #1122

Closed

jacquesqiao approved these changes Jan 11, 2017

View reviewed changes

reyoung mentioned this pull request Jan 11, 2017

Feature/quick start api #1126

Closed

wangkuiyi requested changes Jan 11, 2017

View reviewed changes

emailweixu reviewed Jan 12, 2017

View reviewed changes

jacquesqiao and others added 10 commits January 16, 2017 10:46

add layer abstract

3bc8f99

Refine layers in paddle.v2

26e7ca9

Support multiple input in Paddle.v2.Layer

258fc55

Add Optimizer

2b988b4

Add optimizers

5e1d187

change network config in mnist/api_trian.py to v2

a2cf635

rm gradient_machine.py

7826fdf

Merge pull request #3 from jacquesqiao/v2

d896ff4

change network config in mnist/api_trian.py to v2

Start define model api

e7da4ae

Add Data Interface

6e4086c

reyoung commented Jan 17, 2017

View reviewed changes

reyoung added 2 commits January 17, 2017 14:39

Add create_data_pool method.

286372b

Using new API refactor api_train.

9360a1f

wangkuiyi requested changes Jan 18, 2017

View reviewed changes

jacquesqiao mentioned this pull request Jan 22, 2017

python config parser重构之optimizer #1210

Closed

wangkuiyi requested changes Feb 2, 2017

View reviewed changes

qingqing01 mentioned this pull request Feb 13, 2017

Data reader for api #1326

Closed

reyoung closed this Mar 9, 2017

lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024

add ACT PP-MiniLM demo (PaddlePaddle#1108)

157dcb2

		@@ -178,7 +170,7 @@ def main():
		test_data_generator = input_order_converter(read_from_mnist(test_file))
		for data_batch in generator_to_batch(test_data_generator, 512):

		return self.__pool__[begin:end]


		def create_data_pool(file_reader,

Conversation

reyoung commented Jan 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacquesqiao commented Jan 11, 2017

Uh oh!

wangkuiyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reyoung Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangkuiyi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reyoung Jan 12, 2017 •

edited

Loading

wangkuiyi left a comment •

edited

Loading

wangkuiyi Feb 2, 2017 •

edited

Loading