发表文章

[Python] fit_constrained 重复数据消除 fit_constrained de-duplication[statsmodels]

jbrockmendel 2017-10-9 155

GLM.fit_constrainedPoisson.fit_constrained几乎与相同 base._constraints.fit_constrained_wrap 。 主要的区别是前两个不通过 start_params=params 进入 self.fit 。 是否有令人信服的理由不重复这些?

这些函数中有几个 TODO 注释;我宁愿消除他们, 然后再试图解决这些问题。 同时, 所有三都是我最不喜欢的事情之一: 修改 df_model & df_resid 的地方, 所以重复将我们远离这种模式。

原文:

GLM.fit_constrained and Poisson.fit_constrained and almost identical to base._constraints.fit_constrained_wrap. The main difference is that the former two do not pass start_params=params into self.fit. Is there a compelling reason not to de-duplicate these?

There are several TODO comments in these functions; I'd rather de-dup them before trying to address those issues. Also all three are places that do one of my least-favorite things: modify df_model&df_resid in-place, so de-duping will move us away from that pattern.

相关推荐
最新评论 (11)
josefpkt 2017-10-9
1

我怀疑我会碰这个, 除非它是一个2轮重构。

这是它需要一个全面的审查, 因为它是从来没有审查, 因为它是合并。
一些类似于离散的模型正在等待一个额外的 offset 选项。
我想我也有一个问题, 只允许均匀的约束, 这应该也工作的模型, 没有抵消。

另一个项目是, 我们没有多少 score_tests 的模型, 这是另一个主要的 TODO 项目, 是 fit_constrained 的原因之一。
只用一个模型类来进行实验和单元测试更容易, 而不是让它为所有的人工作。

(它有点类似于 cov_type, 从一个模型开始, 慢慢地朝着完全通用支持的方向工作。

原文:

I doubt I will touch this unless it is a round 2 refactoring.

That is it needs a full review, because it was never reviewed since it was merged.
Some models like those in discrete are waiting for a extra offset option.
I think I also have an issue to allow only homogeneous constraints, which should work also with models that don't have offset.

Another item is that it we don't have much score_tests in the models yet, which are another main TODO item and was one of the reasons for fit_constrained.
It's easier to experiment and unit test with just one model class, instead of getting it to work for all.

(It's a bit similar to cov_type, start in one model and work slowly towards full generic support.)

jbrockmendel 2017-10-9
2

我怀疑我会碰这个, 除非它是一个2轮重构。

在就地调整 df_modeldf_resid 打扰我足够, 我会自愿采取的1轮, 如果你将满足我的中途和同意重复在所有三个位置复制的下列逐字行:

    res._results.params = params
    res._results.normalized_cov_params = cov
    k_constr = len(q)
    res._results.df_resid += k_constr
    res._results.df_model -= k_constr
    res._results.constraints = lc
    res._results.k_constr = k_constr
    res._results.results_constrained = res_constr

不是逐字的主位是对的调用 self.fit , 在这里我怀疑由于 maxiter=0 (我们可以另一天保存的问题) 的差异是不相关的。

# base._constraints.fit_constrained_wrap:
    # create dummy results Instance, TODO: wire up properly
    res = self.fit(start_params=params, maxiter=0,
                   warn_convergence=False) # we get a wrapper back

# Poisson.fit_constrained:
        # create dummy results Instance, TODO: wire up properly
        res = self.fit(maxiter=0, method='nm', disp=0,
                       warn_convergence=False) # we get a wrapper back

# GLM.fit_constrained:
        # create dummy results Instance, TODO: wire up properly
        res = self.fit(start_params=params, maxiter=0) # we get a wrapper back
原文:

I doubt I will touch this unless it is a round 2 refactoring.

The in-place adjusting of df_model and df_resid bother me enough that I'll volunteer to take on the round 1 if you'll meet me half-way and agree to de-duplicate the following verbatim lines duplicated in all three locations:

    res._results.params = params
    res._results.normalized_cov_params = cov
    k_constr = len(q)
    res._results.df_resid += k_constr
    res._results.df_model -= k_constr
    res._results.constraints = lc
    res._results.k_constr = k_constr
    res._results.results_constrained = res_constr

The main bit that is not verbatim is the call to self.fit, where I suspect the differences are irrelevant because of the maxiter=0 (a matter we can save for another day)

# base._constraints.fit_constrained_wrap:
    # create dummy results Instance, TODO: wire up properly
    res = self.fit(start_params=params, maxiter=0,
                   warn_convergence=False) # we get a wrapper back

# Poisson.fit_constrained:
        # create dummy results Instance, TODO: wire up properly
        res = self.fit(maxiter=0, method='nm', disp=0,
                       warn_convergence=False) # we get a wrapper back

# GLM.fit_constrained:
        # create dummy results Instance, TODO: wire up properly
        res = self.fit(start_params=params, maxiter=0) # we get a wrapper back
josefpkt 2017-10-9
3

修改 df_model & df_resid 就地,

我只看着泊松: 它不修改模型. 仅 df_xxx 结果实例, 所以至少不会导致过时的状态, 不同步的属性。
有关一般问题, 请参见#2393 , 以及如何避免 "改造" 结果实例的一些想法

原文:

modify df_model&df_resid in-place,

I only looked at Poisson: It doesn't modify the model.df_xxx only the results instance, so at least it doesn't cause stale state, out-of-sync attributes.
see #2393 for the general issue, and some ideas how to avoid "retrofitting" the results instance

jbrockmendel 2017-10-9
4

它不会修改模型. 仅 df_xxx 结果实例, 因此至少不会导致过时的状态, 不同步的属性。

对, 这不是一个特别大的问题, 但它仍然困扰我, 部分原因是它出现时, 我 grep 为 (df_model|df_resid) [\+\-]=

原文:

It doesn't modify the model.df_xxx only the results instance, so at least it doesn't cause stale state, out-of-sync attributes.

Yah, it's not an especially big deal, but it still bothers me in part because it shows up when I grep for (df_model|df_resid) [\+\-]=.

josefpkt 2017-10-9
5

并同意重复在所有三个位置复制的以下逐字行

问题是, 我不想消除和隐藏一个丑陋的代码的一部分。首先, 整个方法的重复需要离开。其次, (重新) 创建结果实例的设计还不清楚, 所以我还不知道如何删除代码中丑陋的部分。

当然, 在代码中可能会有改进的地方, 但总的来说, 它更是一个主题周的大小, 总体设计需要被评估, 或者至少是主要问题得到更好的解决方案。

原文:

... and agree to de-duplicate the following verbatim lines duplicated in all three locations

The problem is that I don't want to de-dup and hide away one piece of the ugly code. First the entire method duplication needs to go away. Second, the design for (re)creating results instances is not yet clear, so I don't know yet how to remove the ugly parts of the code.

For sure there would be things that could be improved in the code first, but overall it is more the size of a topic week where the overall design needs to be evaluated or where at least the main problems get a better solution.

jbrockmendel 2017-10-9
6

问题是, 我不想消除和隐藏一个丑陋的代码的一部分。首先, 整个方法的重复需要离开。其次, (重新) 创建结果实例的设计还不清楚, 所以我还不知道如何删除代码中丑陋的部分。

请重新考虑一下。 不要让完美成为敌人的好。

总是会有更多的主题 Week-sized 问题, 而不是在几周内解决它们。 对于许多事情, 小 (最好是小到足以让审阅者验证正确性) 增量改进是我们将得到的唯一类型的改进。

原文:

The problem is that I don't want to de-dup and hide away one piece of the ugly code. First the entire method duplication needs to go away. Second, the design for (re)creating results instances is not yet clear, so I don't know yet how to remove the ugly parts of the code.

Please reconsider. Do not let the perfect be the enemy of the good.

There will always be more Topic Week-sized issues than weeks in which to address them. For many things, small (ideally small enough that they are trivial for a reviewer to verify correctness) incremental improvements are the only type of improvements we are going to get.

josefpkt 2017-10-9
7

即使我同意在一般情况下, 我更喜欢做改变时, 有一个实际的需求, 如 bug 修复, 较小的扩展或准备更大的扩展, 至少这是我试图把我的优先事项。

增量改进是我们将得到的唯一类型的改进

然而, 在某些情况下, 这种增量改进是不够的。
E, g, 对比这里与 MNLogit
MNLogit 本质上是作为一个模型来解决的, 但是我们需要修复当前不起作用的东西, 因为模式不同。每个增量更改都用于完成丢失的部分, 并且任何现在已修复的内容都不需要再次修复。

在 cov_type、fit_constrained 和类似的情况下, 需要进行另一轮大的重构。为什么我们现在把时间花在小的重构上, 而不是在一年左右的时候?

(在进行了一些问题会审后, 我将返回到 25 bug 问题, 这些缺陷应该被检查并可能修复为https://github.com/statsmodels/statsmodels/projects/8中的第一列0.9。

原文:

Even though I agree in general, I prefer making changes when there is an actual demand for it, e.g. bug fixes, smaller extensions or preparing for bigger extensions, at least that's where I try to put my priority.

incremental improvements are the only type of improvements we are going to get

However, this incremental improvements are not enough in some cases.
E,g, the contrast here versus MNLogit
MNLogit is essentially settled as a model, but we need to fix the things that are currently not working because the pattern differs. Each incremental change works toward finishing up the missing pieces, and anything that is fixed now, does not need to be fixed again.

In cov_type, fit_constrained and similar, another round of big refactoring is needed. Why do we spend now time on small refactorings that won't survive the next big round, in maybe a year or so?

(After a bit of issue triage, I'm back to 25 bug issues that should be checked and possibly fixed for 0.9, first column in https://github.com/statsmodels/statsmodels/projects/8 )

jbrockmendel 2017-10-9
8

好的, 我特此承诺这是我最后的帖子在线上, 所以你会得到最后一个字。

我喜欢在实际需要时进行更改, 例如 bug 修复、较小的扩展或准备更大的扩展

这是一个忽略技术债务的食谱。

然而, 在某些情况下, 这种增量改进是不够的。

我几乎不知道该如何解析这个句子。 您的首选项是否不满足 IIA? 这里的选择空间是 "现状", "比现状略好"。

在 cov_type、fit_constrained 和类似的情况下, 需要进行另一轮大的重构。

这里有 catch-22 的危险 我可以做一个大公关, 消除 ModelClasses.get_robustcov_results 在一个一举, 但这将是一个很大的差异, 以不明显一目了然。 你会 (正确地) 回应你有更高的优先级。 相比之下, 你 (或凯文, 或乍得 [或我, 如果它不是我的公关]) 可以相当立即 * 确认#3974是剪切/粘贴, 影响零逻辑, 并在这个代码气味小凹痕。 和小的改善加起来, 特别是如果你的注意力不再是限制因素。

* 在一个非平凡的拓扑结构下, 可以减少 PR 的大小, 以满足审阅者对 "立即" 的首选定义的任何有限数量。

MNLogit 本质上是作为一个模型来解决的, 但是我们需要修复当前不起作用的东西, 因为模式不同。

考虑激励措施。 我一直在工作 MNLogit 为支付账单的工作, 所以有问题的头脑, 并有一堆代码处理角落的案件。 但把它变成一个公关 (s) 将是的非平凡的, 我希望如果我这样做, 你要么说 a) 它不能解决任何事情, 所以没有或 b) 它是太大, 所以现在不是。

每个增量更改都用于完成丢失的部分, 并且任何现在已修复的内容都不需要再次修复。

这应该同样适用于 fit_constrained 。 消除重复和清除代码的气味, 使它更易于阅读和更容易解决问题时, 它成为一个优先事项。

需要进行另一轮大的重构。为什么我们现在把时间花在小的重构上, 而不是在一年左右的时候?

  1. 因为小的重构加起来, 使得更大的重构变得更容易或不必要。
  2. 由于分工的特殊性: #3974不需要任何关于内部的大 decision-making, 将代码的气味缓解从大图片的内容中分离出来, 这需要您的认真关注。
  3. 因为我是个贝叶斯 当我读到 "大概一年左右" 的时候, 我看到 statsmodels.interface 这是一个空的 __init__.py 文件已有6年了, discrete_model.Weibull 已经注释 5, 并且 sandbox.mle 从字符串 '''What's the origin of this file? It is not ours. 开始, 并且在4年内没有被触及。

没有一个是说你没有做一个自耕农的工作。 事实上, 托多斯的堆积速度比他们能解决的更快, 这意味着有一个活跃的用户群体和贡献者。 当我要求你重新考虑一些 decision-making 的时候, 我要抱怨的最接近的事情是你们没有两个人。 你有个不的任务

"TODO" 的 greping 924 匹配。 "FIXME" 又变成了21 flake8 statsmodels | grep undefined显示 135 NameError 等待发生 *。 当然, 其中一些可以被委派, 以不需要您的直接关注。

* 公平地说, 28 的投诉是关于 "从. 数据导入 *"。无法检测到未定义的名称 ", 其余的许多都在 sandbox 中。

删除未使用的代码和寻址小代码气味是一种方式, 像我这样的 semi-literates 可以帮助对抗这场失败的对抗熵。

原文:

OK, I'm hereby committing to this being my last post on the thread, so you'll get the last word.

I prefer making changes when there is an actual demand for it, e.g. bug fixes, smaller extensions or preparing for bigger extensions

This is a recipe for ignoring technical debt.

However, this incremental improvements are not enough in some cases.

I barely know how to parse this sentence. Do your preferences not satisfy IIA? The option space here is ["status quo", "slightly better than the status quo"].

In cov_type, fit_constrained and similar, another round of big refactoring is needed.

There's risk of a catch-22 here. I could make a big PR that de-duplicates ModelClasses.get_robustcov_results in one swoop, but that would be a big enough diff as to not be obvious at a glance. You would (correctly) respond that you have higher priorities. By contrast, you (or Kevin, or Chad [or me if it wasn't my PR]) can pretty immediately* confirm that #3974 is cut/paste, affects zero logic, and makes a small dent in this code smell. And small improvements add up, especially if your attention ceases to be the limiting factor.

* Under a non-trivial topology, the size of the PR can be reduced to satisfy any finite number of reviewers' preferred definition of "immediately".

MNLogit is essentially settled as a model, but we need to fix the things that are currently not working because the pattern differs.

Consider incentives. I've been working on MNLogit for bill-paying work, so have the Issues fresh in mind and have a pile of code for handling corner cases. But turning that into a PR(s) would be really nontrivial, and I expect that if I did so, you'd either say a) it doesn't solve everything so No or b) it is too big so Not Now.

Each incremental change works toward finishing up the missing pieces, and anything that is fixed now, does not need to be fixed again.

This should apply to fit_constrained just as well. De-duplicating and cleaning up code smells makes it more readable and easier to approach the problem when it does become a priority.

another round of big refactoring is needed. Why do we spend now time on small refactorings that won't survive the next big round, in maybe a year or so?

  1. Because small refactorings add up, make bigger refactorings either easier or unnecessary.
  2. Because of specialization of labor: #3974 doesn't require any big decision-making about internals, separates code-smell-alleviation from the Big Picture stuff that requires your Serious Attention.
  3. Because I'm a Bayesian. When I read "maybe a year or so" I look at statsmodels.interface which has been an empty __init__.py file for 6 years, discrete_model.Weibull that has been commented-out for 5, and sandbox.mle that starts with the docstring '''What's the origin of this file? It is not ours. and hasn't been touched in 4 years.

None of which is to say you're not doing a yeoman's job. The fact that TODOs are piling up faster than they can be addressed means there's an active community of users and contributors. While I am asking you to reconsider some decision-making, the closest thing I have to a complaint is that there isn't two of you. You've got a Sisyphean task.

greping for "TODO" turns up 924 matches. "FIXME" turns up another 21. flake8 statsmodels | grep undefined shows 135 NameErrors waiting to happen*. Surely some of these can be delegated so as to not require your direct attention.

* In fairness, 28 of these are complaints about "from .data import *; unable to detect undefined names" and the many of the rest are in sandbox.

Removing unused code and addressing small code smells is a way that semi-literates like myself can help fight this losing battle against entropy.

josefpkt 2017-10-9
9

好的, 关于 statsmodels 发展和我的一些一般思想

首先, 我不像以前那样在 statsmodels 工作。我没有得到报酬, 我需要在生活中有一些乐趣和工作时, statsmodels (我喜欢现在的书, 以 statsmodels 维护在 (晚) 晚上,。

提出主题周是一个试图集中时间, 我的参与, 不紧急的主题。这是非常困难的, 我看的东西没有或让我思考, 这往往需要时间。有些东西可以根据一般印象和单元测试合并, 特别是新的, 仍然是 "实验性的" 功能。其他部分由克尔比, 乍得, 凯文, 或汤姆照顾。

整体 statsmodels 是一个巨大的建筑工地, 包括地板上的污垢, 许多粘性标签 (托多斯), 部分的建筑计划遍布所有地区, 部分是临时棚屋, 这给了我们一个屋顶, 但不是一个坚实的建筑。此外, 我们有泄漏各地的地方, 在一些地方只是微小的, 在其他地方仍然大洞。有些情况下, 如果我们试图堵塞一个洞, 它只是打开了另一个部分的孔。

我的问题是, 当我看装饰一个窝棚或在一个洞, 然后我开始思考我们需要一个新的窗口或更坚实的墙壁, 或取代整个窝棚和建筑计划, 以全面替代, 无论是下一个较大或更坚实的窝棚或 永久性建筑。

有两种类型的限制, 一个是我们有设计或计划, 但缺乏的禾/人力和第二, 我们不知道什么设计将工作的永久解决方案, 我们使用的 "棚屋" 在平均时间。我们是否也应该清理和装饰棚屋是一个次要的问题, 取决于环境和时间的地平线。

积极的例子:
我已经在 GLM 工作了好几年了, 部分是为了弄清这个计划是什么。我们有, 并有几个贡献者修复和扩展它, 它将接近功能的完整和坚实。虽然还有很多东西可以添加。
更多的计数模型, 如零膨胀和障碍模型已经在路线图上很长一段时间。我们有一个粗略的计划和一些原型, 最后 GSOC 提供了人力。然而, 在经历或工作 discrete_model, 我打开了更多的问题比我关闭, 因为所有的地方, 我看到的泄漏或设计, 不工作的未来扩展。(一边我也修复了一些泄漏。

最近的负面例子, 我刚刚遇到:
gofplots: 多年的痛点。一个看起来很天真的修正#3547 , 很可能在其他地方导致问题。问题是我们需要一个#3981计划。
emplike/船尾: #3629我没有太多的想法, 只是要数天才知道它应该做什么, 或者可能有一个快速的解决办法。尾部/参数生存模型是另一个主题, 如新的计数模型。有些片断通过代码被传播, 但没有系统的计划和发展。
利润率/预测: 在使用方面的高度优先, 但他们仍然是窝棚, 即使保证金是一个相当坚固的窝棚与部分完整的建设。(我没有对利润率做更小的更改, 因为我只有一个非常模糊的计划, 而且我们目前还没有其他的边距专家。

顺便说一句: 我告诉你在另一个问题, 固定 MNLogit 将得到优先治疗。

唯一真正的出路是获得更有经验的 statsmodels 部分的维护人。
(对我来说, 这意味着我不必跳这么多, 我可能会迷失在我的一些主题周, 而不阻碍 statsmodels 的发展。

原文:

Ok, some general thought on statsmodels development and me

First I'm not working as much on statsmodels as in the old days. I'm not getting paid, and I need to have some fun in life and when working on statsmodels (e.g. I prefer now a book to statsmodels maintenance in the (late) evenings,.)

Bringing up topic weeks was an attempt to concentrate in time my involvement in topics that are not urgent. It is very difficult for me to look at something without having or getting me to think, which often takes time. Some things can be merged based on general impression and unit tests especially new, still "experimental" features. Other parts are taken care of by Kerby, Chad, Kevin, or Tom.

Overall statsmodels is a huge construction site including still dirt on the floor, many sticky tags (TODOs), parts of the building plan spread over all areas, and some parts are temporary shacks, which give us a roof but not a solid building. Additionally we have leaks all over the place, in some places just tiny ones, in other places still big holes. And it some cases, if we try to plug one hole it just opens up a hole in another part.

My problem is that when I look at decorating a shack or taping over a hole, then I start to think about where we need a new window or more solid walls, or replace the entire shack and the building plan for the full replacement, either the next larger or more solid shack or a permanent building.

There are two types of limitations, one where we have the design or plan but lack the wo/manpower and second where we don't know what design will work for a permanent solution, and we use the "shacks" in the mean time. Whether we should also clean up and decorate the shacks is a secondary issue that depends on the circumstances and time horizon.

Positive examples:
I have been working on GLM for several years, partially to figure out what the plan is. We had and have several contributors to fix and extend it and it will be close to feature complete and solid. Although there are still many things that can be added.
More count models like zero-inflated and hurdle models has been on the roadmap for a long time. We had a rough plan and some prototypes and finally the GSOC provided the manpower. However, while going through or working on discrete_model, I opened more issues than I closed because all over the place I saw leaks or designs that don't work for future extensions. (As aside I also fixed some of those leaks.)

Recent negative examples, that I just ran into:
gofplots: A sore point for years. An innocent looking bugfix #3547 that most likely causes a problem somewhere else. The problem is that we need a plan #3981.
emplike/AFT: #3629 I don't have much of an idea, except that it will several days just to figure out what it is supposed to do, or maybe there is a quick fix. AFT/parametric survival models is another topic like the new count models. Some pieces are spread out through the code, but there is no systematic plan and development.
margins/predict: High priority in terms of usage, but they are still shacks, even if margin is a pretty solid shack with parts of the full building. (I didn't make smaller changes to margins because I have only a very vague Plan, and we don't have currently another expert on margins.)

BTW: I told you in another issue that fixing MNLogit will get priority treatment.

The only real way out is to get more experienced maintainers for parts of statsmodels.
(For me this would mean that I don't have to jump around so much, and I could get lost in some of my topic weeks without blocking statsmodels development.)

josefpkt 2017-10-9
10

https://www.today.com/home/fixer-upper-season-4-finale-little-shack-prarie-t109807:)

原文:

https://www.today.com/home/fixer-upper-season-4-finale-little-shack-prarie-t109807 :)

josefpkt 2017-10-9
11

(我昨天写了这篇文章, 但在点击按钮前分心了)
在我被 "棚屋" 分心之前我想提一件事

我对代码样式和整洁的主要标准是, 我需要多长时间才能理解一段代码或函数或方法, 从而了解底层结构, 并能够发现可能的 bug。Uniformative 的名字, 必须在很长一段时间的跳跃, 以找到必要的信息和缺乏基本的格式, 可以使这比需要的时间长得多。跳跃的周围往往是不可避免的, 但它必须是合理的, 它满足特定的需要在算法或代码。

这是我的基本任务之一, 想出一个好的, 快速的猜测, 当有人有一个问题的例子, 无论是一个 bug 或功能/限制, 什么是最有可能的 bug 候选者或代码片断, 使某事可能或不可能。
(例如, VAR 总体上是一个好的结构, 但名称是相当混乱/混淆。

原文:

(i wrote this yesterday but got distracted before hitting the button)
One thing I wanted to mention before I got distracted by "shacks"

My main criterion for code style and cleanliness is how long it takes me to understand a piece of code or function or method, to get an idea about the underlying structure and being able to spot possible bug. Uniformative names, having to jump around a long time to find the necessary information and lack of basic formatting can make this much longer than needed. The jumping around is often unavoidable, but it has to be justified by it satisfying a specific need in the algorithm or code.

It is one of my basic task to come up with good, quick guesses when somebody has an example with a problem, whether it's a bug or a feature/limitation, and what are the most likely bug candidates or pieces of code that makes something possible or impossible.
(e.g. VAR has overall a good structure, but names are quite confusing/obfuscating.)

返回
发表文章
jbrockmendel
文章数
1
评论数
8
注册排名
34970