Use NOTICE to wait cursor declaration #2

beeender · 2019-10-14T03:18:51Z

No description provided.

…e_endpoint

This reverts commit b3e4249. This PR comprises of the following commit 1. Commit 01c9755 - This is from an existing approved PR https://github.com/greenplum-db/gporca/pull/428 which was merged but later reverted due to regression caused due to cardinality under-estimation. 2. Commit 01c9755 - This commit fixes the regression caused due to the previous commit and adds tests for it. So for this PR, the 2nd commit is primarily for review. Repro: ``` create table vt1(a varchar, b varchar); insert into vt1 select 'a' || i, 'b' || i from generate_series(1, 10000000) i; analyze vt1; gnw_edw_dev=# explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..6194.47 rows=10000 width=30) -> Sequence (cost=0.00..6193.35 rows=3334 width=30) -> Shared Scan (share slice:id 1:0) (cost=0.00..569.67 rows=3333334 width=1) -> Materialize (cost=0.00..569.67 rows=3333334 width=1) -> Table Scan on vt1 (cost=0.00..504.33 rows=3333334 width=10) -> Hash Join (cost=0.00..5623.58 rows=3334 width=30) ==> #2 Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text -> Hash Join (cost=0.00..2945.63 rows=3333334 width=20) Hash Cond: share0_ref4.a::text = share0_ref3.a::text #1 -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) -> Hash (cost=511.33..511.33 rows=3333334 width=10) -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) -> Hash (cost=511.33..511.33 rows=3333334 width=10) -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) Optimizer status: PQO version 3.29.0 ``` Check the cardinality: ``` Hash Join (cost=0.00..5623.58 rows=3334 width=30) Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text ``` Since all the tuple values are distinct the cardinality for the above join must be `3333334` (per segment) but since after the commit 01c9755 implied predicates were generated, and the existing code didn't remove binary coercible casted comparison prior to cardinality estimation, it was down to `3334`, which can result in broadcast motion placed on top of it if this node is a child of another join etc significantly reducing the performance. After the fix: => The cardinality is 3333334 for the join `share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text` ``` explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..7760.90 rows=10000000 width=30) -> Sequence (cost=0.00..6642.90 rows=3333334 width=30) -> Shared Scan (share slice:id 1:0) (cost=0.00..569.67 rows=3333334 width=1) -> Materialize (cost=0.00..569.67 rows=3333334 width=1) -> Table Scan on vt1 (cost=0.00..504.33 rows=3333334 width=10) -> Hash Join (cost=0.00..5973.23 rows=3333334 width=30) #1 Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text -> Hash Join (cost=0.00..2945.63 rows=3333334 width=20) Hash Cond: share0_ref4.a::text = share0_ref3.a::text -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) -> Hash (cost=511.33..511.33 rows=3333334 width=10) -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) -> Hash (cost=511.33..511.33 rows=3333334 width=10) -> Shared Scan (share slice:id 1:0) (cost=0.00..511.33 rows=3333334 width=10) Optimizer status: PQO version 3.29.0 ``` Bump ORCA version 3.42.0

To achieve that refactor GetScCmpMdIdApplyCasts to split it into two functions: 1. GetScCmpMdIdConsiderCasts, that looks up a scalar cmp op in the CMDCache; 2. ApplyCastsForScCmp, that applies casts to conform to the scalar cmp op chosen CTranslatorExprToDXL::PdxlnHashJoin() needs to call only #2, and so preserves the scalar cmp op. This commit also applies the same changes for INDF.

Overview -------- Manage the allocation of AO segfiles in QE nodes, instead of the master. That simplifies a lot of things. We no longer need to send the decision on which segfile to use from QD to QE, we no longer need to keep up-to-date tuple counts in the master, and so forth. This gets rid of the in-memory hash table to track AO segment status. To lock a segfile for insertion or other operations, we now rely on tuple locks on the AO segment table instead. VACUUM code has been refactored. I tried to make the code structure closer to upstream, to reduce merge conflicts as we merge with upstream. The way the VACUUM phases are dispatched, and the phases needed, are different in the new regime. Some phases could use local transactions, but others have to still use distributed transactions, but for simplicity we use distributed transactions for all of the phases. Now that QE nodes choose the insertion target segfile independently, we could use the same algorithm for choosing the target segfile in utility mode, and wouldn't need to reserve segfile #0 for special operations anymore. But to keep things familiar, and to avoid any more churn in the regression tests, this keeps the historical behavior and still reserves segfile #0 for utility mode and CREATE TABLE AS. Previously, SERIALIZABLE (and REPATABLE READ) transactions were treated differently from transactions in READ COMMITTED mode. If there was an old serializable transaction running, VACUUM refrained from recycling old AO segfiles after compaction, because the serializable transaction might lock the table later and try to read the rows that were compacted away. But that was not completely waterproof; a READ COMMITTED transaction that's still holding onto an old snapshot is no different from a serializable transaction. There's a gap between acquiring a snapshot and locking a table, so even if a READ COMMITTED backend is not holding a lock on a table right now, it might open it later. To fix that, remove the distinction between serializable and other transactions. Any open transaction will now prevent AO segfiles in awaiting-drop state from being removed. This commit adds an isolation2 test case called 'vacuum_self_function' to test one such case; it used to fail before this commit. That is a user-visible change in behavior that's worth calling out explicitly. If there are concurrent transactions running, even in READ COMMITTED mode, and even if they don't access the table at all, VACUUM will no longer be able to recycle compacted segfiles. Next VACUUM will clean them up, after the old concurrent transactions. Meanwhile, the awaiting-drop segfiles will use up disk space, and increase the risk that an insertion fails because all segfiles are in use. This work was originally submitted github PR #790; you might see references to that in the discussions. Code change details ------------------- This gets rid of the in-memory hash table to track AO segment status. To lock a segfile for insertion or other operations, we now rely on tuple locks on the AO segment table instead. When an AO segment is chosen as insertion target, its AO segment table row is locked for the transaction (by heap_lock_tuple, i.e. by stamping its XMAX). Later, when the insertion finishes, the tuple is updated with the new EOF (no change there). When choosing a new insertion target segfile, we scan the pg_aoseg table instead of the hash table, and use the xmin/xmax to determine which ones are safe to use. The process of choosing a target segment itself is not concurrency-safe, so the relation extension lock is held to protect that. See new "Locking and snapshots" section in src/backend/access/appendonly/README.md for details. tupcount and modcount are not kept up-to-date in the master. Remove all the code needed to send the totals back from QE to QD at end of a command, including the GPDB-specific 'o' protocol FE/BE message. Remove the machinery to track which PROCs are in a serializable transaction. It was not completely waterproof anyway, and there were some existing bugs in that area. A READ COMMITTED transaction that's still holding onto an old snapshot is no different from a serializable transaction. This commit adds an isolation2 test case called 'vacuum_self_function' to test one such case; it used to fail before this commit. Move responsibility of creating AO segfile to callers of appendonly_insert_init(). The caller must have locked, and created if necessary, the segfile by calling ChooseSegnoForWrite() or LockSegnoForWrite(). Previously, the pg_aoseg was not locked in rewriting ALTER TABLE and other such commands that always used segfile #0. That was OK, because they held an AccessExclusiveLock on the table, but let's be consistent. When selecting the insertion target for VACUUM compaction, prefer creating a new segfile over using an existing non-empty segfile. Existing empty segfiles are still choice #1, but let's avoid writing to non-empty existing files, because otherwise we cannot vacuum those segfiles in the same VACUUM cycle. Warn, if compaction fails because no insertion segfile could be chosen. This used to be an ERROR in the old code, but I think a WARNING is more appropriate. In targetid_get_partition(), don't create duplicate ResultRelInfo entries. It's possible that the "result relation" is a partition of a partitioned table, rather than the root. Don't create a new ResultRelInfo hash entry for it in that case, use the parent ResultRelInfo. This was harmless before, but now the logic of when we need to increment the modcount was getting confused. Per the 'qp_dropped_cols' regression test. Test change details ------------------- src/test/regress/greenplum_schedule Run uao[cs]_catalog_tables tests separately. Because we are now more lazy with dropping AWAITING_DROP segments, the issue mentioned in the comment with serializable transactions now applies to all transactions. uao_compaction/threshold uaocs_compaction/threshold Adjust uao[cs]_compaction/threshold tests case to make it pass. In the first few changed outputs, I'm not entirely sure why one segment chooses to insert the new tuples to segfile #2 while others pick segfile #1, but there's also nothing wrong with that. The point of this PR is that QE nodes can make the decision independently. In the last test with gp_toolkit.__gp_aovisimap_compaction_info(), all the test data is inserted to one QE node, so segfile 2 isn't allocated on others. isolation2/vacuum_self_function Add test case for the non-SERIALIZABLE case discussed earlier. See also discussion at https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ This test errored out before this commit: ERROR: read beyond eof in table "ao" file "base/16384/16387.1", read position 0 (small offset 0), actual read length 0 (large read length 192) (cdbbufferedread.c:211) Works with this PR, but we don't drop old AO segments as aggressively in vacuum anymore. Partly because we're now more careful and correct. But also because we don't detect the case that the VACUUM is the oldest transaction in the system, and nothing else is running; we miss the opportunity to drop compacted AO segments immediately in that case. The next VACUUM should drop them though. Furthermore, this test is written so that if we tried to perform AO compaction phase in a local rather than distributed transaction, it would fail. (I know because I tried to do it that way at first.) This is "Problem 3" that I mentioned on that gpdb-dev thread. isolation2/fsync_ao Adjust "hit counts" in fsync_ao test. The number of times the fsyncs happen have changed. To be honest, I'm not sure why, nor what the principled way to get the correct value would be, but as long as the counts are > 0, this looks like success to me. isolation2/gdd/planner_insert_while_vacuum_drop Remove GDD test that's not relevant anymore. The test was testing a deadlock between QD and QE, involving AO vacuum. We no longer take an AccessExclusiveLock in the QD, so the test case doesn't work. I'm not sure what an equivalent deadlock might be with the new AO vacuum, so just remove the test. isolation2/vacuum_drop_phase_ao isolation2/uao/insert_should_not_use_awaiting_drop These tests don't make sense in the QE-managed regime anymore. I couldn't figure out what these should look like with this commit, so remove. isolation2/uao/insert_policy Update test case to cover a scenario where two transactions try to create the initial segment for relation. We had a bug in that scenario during development, and it wasn't covered by any existing tests. isolation2/uao/select_while_vacuum_serializable2 The behavior is different now, a serializable transaction doesn't prevent compaction phase from happening, only the recycling of the segfile. Adjust expected output accordingly. isolation2/uao/compaction_utility VACUUM in utility mode can compact now. Co-authored-by: Abhijit Subramanya <asubramanya@pivotal.io> Co-authored-by: Ashwin Agrawal <aagrawal@pivotal.io> Co-authored-by: Asim R P <apraveen@pivotal.io> Co-authored-by: Xin Zhang <xzhang@pivotal.io> Reviewed-by: Georgios Kokolatos <gkokolatos@pivotal.io> Reviewed-by: Taylor Vesely <tvesely@pivotal.io> Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/lUwtNxaT3f0/Cg39CP8uAwAJ

JunfengYang and others added 9 commits October 12, 2019 14:27

Check for acknowledge NOTICE form QEs

a8e975e

Check for acknowledge NOTICE form QEs

409374a

move cdbdisp_checkDispatchAckNotice to WaitEndpointReady

0de0c71

Merge branch 'notice_endpoint' of github.com:liming01/gpdb into notic…

e60bf08

…e_endpoint

remove useless code

a73f125

Make notice callback work

9ca428d

Should wait if ack notice not received on qd

9aa6193

Remove useless code.

99e218a

Clean up code

83accf0

liming01 force-pushed the feature_parallelcursor branch from bf4a724 to 6297e2f Compare November 1, 2019 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NOTICE to wait cursor declaration #2

Use NOTICE to wait cursor declaration #2

Uh oh!

beeender commented Oct 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use NOTICE to wait cursor declaration #2

Are you sure you want to change the base?

Use NOTICE to wait cursor declaration #2

Uh oh!

Conversation

beeender commented Oct 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants