发表文章

[Java] sklearn.preprocessing.Normalizer 的实施 Implementation of sklearn.preprocessing.Normalizer[jpmml-sklearn]

yaronskaya 2017-10-9 50

您好, 我试图实现 sklearn.preprocessing.Normalizer 与 l1 规范作为自定义变压器。

publicList<Feature> encodeFeatures(List<Feature>SkLearnEncoder encoder) {

        List<Feature>=newArrayList<>();

        Apply=PMMLUtil.createApply("+");
        for(Feature: features){
            sumExpression.addExpressions(feature.toContinuousFeature().ref());
        }
        FieldName=FieldName.create("sum-of-features");
        DerivedField= encoder.createDerivedField(name, sumExpression);
        ContinuousFeature=newContinuousFeature(encoder, sumField);

        for(int=0< features.size(); i++) {
            Feature= features.get(i);
            ContinuousFeature= feature.toContinuousFeature();
            Expression= continuousFeature.ref();
            =PMMLUtil.createApply("/", expression, sumFeature.ref());
            DerivedField= encoder.createDerivedField(createName(continuousFeature), expression);
            result.add(newContinuousFeature(encoder, derivedField));
        }
        return result;
}

它生成 pmml, 但在运行评估过程中出现错误 "预期为2参数, 但有3000参数", 其中3000是功能. 大小 ()。

我做错什么了吗?

原文:

Hi, I've tried to implement sklearn.preprocessing.Normalizer with l1 norm as custom Transformer.

public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder) {

        List<Feature> result = new ArrayList<>();

        Apply sumExpression = PMMLUtil.createApply("+");
        for(Feature feature : features){
            sumExpression.addExpressions(feature.toContinuousFeature().ref());
        }
        FieldName name = FieldName.create("sum-of-features");
        DerivedField sumField = encoder.createDerivedField(name, sumExpression);
        ContinuousFeature sumFeature = new ContinuousFeature(encoder, sumField);

        for(int i = 0; i < features.size(); i++) {
            Feature feature = features.get(i);
            ContinuousFeature continuousFeature = feature.toContinuousFeature();
            Expression expression = continuousFeature.ref();
            expression = PMMLUtil.createApply("/", expression, sumFeature.ref());
            DerivedField derivedField = encoder.createDerivedField(createName(continuousFeature), expression);
            result.add(new ContinuousFeature(encoder, derivedField));
        }
        return result;
}

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

Am I doing something wrong?

相关推荐
最新评论 (8)
vruusmann 2017-10-9
1

它生成 pmml, 但在运行评估过程中出现错误 "预期为2参数, 但有3000参数", 其中3000是功能. 大小 ()。

+算术函数是一个二进制函数, 它完全采用两个参数:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#arith

如果要对任意数目的参数求和, 则应使用 sum 聚合函数:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#min

因此, 下面的代码更改应该执行以下操作:

Apply sumExpression = PMMLUtil.createApply("sum");
原文:

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

The + arithmetic function is a binary function, which takes exactly two arguments:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#arith

If you want to sum any number of arguments, then you should use the sum aggregation function:
http://dmg.org/pmml/v4-3/BuiltinFunctions.html#min

So, the following code change should do the trick:

Apply sumExpression = PMMLUtil.createApply("sum");
yaronskaya 2017-10-9
2

@vruusmann, 谢谢您的回复!
无论如何, 是否有实现正常化?我特别感兴趣的是为什么 TfIdfVectorizer 不支持它。

原文:

@vruusmann, thanks for response!
Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

vruusmann 2017-10-9
3

@vruusmann, 谢谢您的回复!

它解决了你的问题吗?我关闭了这个问题, 因为从理论上讲, 从 +sum 的更改应该是这样的, 但在实践中听到所发生的情况会很好。

无论如何, 是否有实现正常化?我特别感兴趣的是为什么 TfIdfVectorizer 不支持它。

缺乏 time-need 来关注更重要的项目。

原文:

@vruusmann, thanks for response!

Did it solve your problem? I closed this issue, because theoretically the change from + to sum should be it, but it would be nice to hear what happened in practice.

Anyway, is there any implementation of normalization? I'm particularly interested why TfIdfVectorizer doesn't support it.

Lack of time - need to focus on more important projects.

yaronskaya 2017-10-9
4

对, 对我有用!
我可以再问一个问题吗?我发现, 即使转换字典包含两个变体, jpmml 计算器也会为原始和小写字符串选择相等的 tf df 值。有办法克服吗?

原文:

Yes, it works for me!
Can I ask one more question here? I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants. Is there a way to overcome that?

vruusmann 2017-10-9
5

我发现, 即使转换字典包含两个变体, jpmml 计算器也会为原始和小写字符串选择相等的 tf df 值。

可以通过自定义 TextIndex@isCaseSensitive 属性的值来控制此行为:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

请注意, 此属性可能由 TextIndexNormalization@isCaseSensitive 属性重写:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndexNormalization

检查 PMML 文档以查看实际的配置 ( TextIndex 元素应该始终存在, TextIndexNormalization 元素更具体)。

这两个属性的默认值都是 false , 这意味着忽略标记的大写。将其更改为 true , 则只考虑具有正确大写的标记。

(J) PMML 行为与 Scikit 学习行为有何不同?如果是这样, 则可能需要打开一个新问题来实现适当的修复。但是, 一定要伴随这个问题与一个可重现的 Python 代码示例 (eg. 基于 "情绪" 数据集, 这是 src/test/resources/ 目录下的 JPMML-SkLearn 集成测试套件的一部分)-没有时间来三角化潜在的问题自己.

原文:

I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

Please note that this attribute may be overriden by the TextIndexNormalization@isCaseSensitive attribute:
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndexNormalization

Check your PMML document to see the actual configuration (the TextIndex element should be always present, the TextIndexNormalization element is more specific).

The default value of both attributes is false, which means that the capitalization of tokens is ignored. It you change it to true, then only tokens that have correct capitalization will be taken into consideration.

Is the (J)PMML behaviour different from Scikit-Learn behaviour? If so, then it might be worthwhile to open a new issue to implement an appropriate fix. However, be sure to accompany this issue with a reproducible Python code example (eg. based on the "Sentiment" dataset, which is part of the JPMML-SkLearn integration test suite under the src/test/resources/ directory) - don't have the time to triangulate the potential problem myself.

yaronskaya 2017-10-9
6

谢谢你的快速反应。

可以通过自定义 TextIndex@isCaseSensitive 属性的值来控制此行为:

它帮助了我。

(J) PMML 行为与 Scikit 学习行为有何不同?

修复后, 它给出相同的结果。

我实施了不同的规范化方法, 给出了与 Scikit 学习标准件相同的结果。我也想把它融入 TfIdfVectorizer。
我可以为它创建拉请求吗?

原文:

Thank you for quick response.

This behaviour can be controlled by customizing the value of the TextIndex@isCaseSensitive attribute:

it helped me.

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

I've implemented different ways of normalization that give the same result as Scikit-learn normalizer. I'm also thinking of integrating it into TfIdfVectorizer.
Can I create pull request for it?

vruusmann 2017-10-9
7

(J) PMML 行为与 Scikit 学习行为有何不同?

修复后, 它给出相同的结果。

目标是 (J) PMML 和 Scikit-学习预测应该与默认情况下匹配。因此, 可能需要重新访问转换器以进行 CountVectorizer (或 TfidfVectorizer ) 转换, 并确保所有 case-sensitivity 属性都已正确初始化。

我可以为它创建拉请求吗?

我一般不接受知识产权保护 (版权等) 的原因。

不过, 欢迎您总结您的观察和代码更改 (eg. patchfile), 我将把它们带到 JPMML-SkLearn 存储库作为我的原始工作。将信任您在提交消息。

原文:

Is the (J)PMML behaviour different from Scikit-Learn behaviour?

After the fix it gives same results.

The goal is that (J)PMML and Scikit-Learn predictions should match by default. So, it might be necessary to revisit the converter for the CountVectorizer (or TfidfVectorizer) transformation, and make sure that all case-sensitivity attributes are properly initialized.

Can I create pull request for it?

I don't generally accept PRs for IPR (copyrights etc.) reasons.

However, you're welcome to summarize your observations and code changes (eg. a patchfile), and I will carry them over to the JPMML-SkLearn repository as my original work. Will credit you in the commit message.

vruusmann 2017-10-9
8

打开了一个关于 TF-以色列国防军 case-sensitivity 的新问题:
#51

请列出您的所有相关观察 (和建议的修复)。

原文:

Opened a new issue about TF-IDF case-sensitivity:
#51

Please list all your relevant observations (and suggested fixes) there.

返回
发表文章
yaronskaya
文章数
1
评论数
3
注册排名
60747