Skip to content

Conversation

@beeender
Copy link
Collaborator

No description provided.

@liming01 liming01 force-pushed the feature_parallelcursor branch from bf4a724 to 6297e2f Compare November 1, 2019 00:27
linwen pushed a commit that referenced this pull request Apr 17, 2020
This reverts commit b3e4249.

This PR comprises of the following commit
1. Commit 01c9755 - This is from an existing approved PR
https://github.com/greenplum-db/gporca/pull/428 which was merged but later
reverted due to regression caused due to cardinality under-estimation.
2. Commit 01c9755 - This commit fixes the regression caused due to the previous
commit and adds tests for it.

So for this PR, the 2nd commit is primarily for review.

Repro:
```
create table vt1(a varchar, b varchar);
insert into vt1 select 'a' || i, 'b' || i from generate_series(1, 10000000) i;
analyze vt1;
gnw_edw_dev=# explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..6194.47 rows=10000 width=30)
   ->  Sequence  (cost=0.00..6193.35 rows=3334 width=30)
         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..569.67 rows=3333334 width=1)
               ->  Materialize  (cost=0.00..569.67 rows=3333334 width=1)
                     ->  Table Scan on vt1  (cost=0.00..504.33 rows=3333334 width=10)
         ->  Hash Join  (cost=0.00..5623.58 rows=3334 width=30) ==> #2
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
               ->  Hash Join  (cost=0.00..2945.63 rows=3333334 width=20)
                     Hash Cond: share0_ref4.a::text = share0_ref3.a::text #1
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
                     ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
               ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
 Optimizer status: PQO version 3.29.0
```

Check the cardinality:
```
 Hash Join  (cost=0.00..5623.58 rows=3334 width=30)
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
```
Since all the tuple values are distinct the cardinality for the above join must
be `3333334` (per segment) but since after the commit 01c9755 implied
predicates were generated, and the existing code didn't remove binary coercible
casted comparison prior to cardinality estimation, it was down to `3334`, which
can result in broadcast motion placed on top of it if this node is a child of
another join etc significantly reducing the performance.

After the fix: => The cardinality is 3333334 for the join `share0_ref4.a::text
= share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text`
```
explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..7760.90 rows=10000000 width=30)
   ->  Sequence  (cost=0.00..6642.90 rows=3333334 width=30)
         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..569.67 rows=3333334 width=1)
               ->  Materialize  (cost=0.00..569.67 rows=3333334 width=1)
                     ->  Table Scan on vt1  (cost=0.00..504.33 rows=3333334 width=10)
         ->  Hash Join  (cost=0.00..5973.23 rows=3333334 width=30) #1
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
               ->  Hash Join  (cost=0.00..2945.63 rows=3333334 width=20)
                     Hash Cond: share0_ref4.a::text = share0_ref3.a::text
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
                     ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
               ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
 Optimizer status: PQO version 3.29.0
```

Bump ORCA version 3.42.0
linwen pushed a commit that referenced this pull request Apr 17, 2020
To achieve that refactor GetScCmpMdIdApplyCasts to split it into two functions:
 1. GetScCmpMdIdConsiderCasts, that looks up a scalar cmp op in the CMDCache;
 2. ApplyCastsForScCmp, that applies casts to conform to the scalar cmp op chosen

CTranslatorExprToDXL::PdxlnHashJoin() needs to call only #2, and so
preserves the scalar cmp op. This commit also applies the same changes for INDF.
linwen pushed a commit that referenced this pull request Apr 17, 2020
Overview
--------

Manage the allocation of AO segfiles in QE nodes, instead of the master.
That simplifies a lot of things. We no longer need to send the decision
on which segfile to use from QD to QE, we no longer need to keep up-to-date
tuple counts in the master, and so forth.

This gets rid of the in-memory hash table to track AO segment status. To
lock a segfile for insertion or other operations, we now rely on tuple
locks on the AO segment table instead.

VACUUM code has been refactored. I tried to make the code structure closer
to upstream, to reduce merge conflicts as we merge with upstream. The way
the VACUUM phases are dispatched, and the phases needed, are different in
the new regime. Some phases could use local transactions, but others have
to still use distributed transactions, but for simplicity we use
distributed transactions for all of the phases.

Now that QE nodes choose the insertion target segfile independently, we
could use the same algorithm for choosing the target segfile in utility
mode, and wouldn't need to reserve segfile #0 for special operations
anymore. But to keep things familiar, and to avoid any more churn in the
regression tests, this keeps the historical behavior and still reserves
segfile #0 for utility mode and CREATE TABLE AS.

Previously, SERIALIZABLE (and REPATABLE READ) transactions were treated
differently from transactions in READ COMMITTED mode. If there was an
old serializable transaction running, VACUUM refrained from recycling
old AO segfiles after compaction, because the serializable transaction
might lock the table later and try to read the rows that were compacted
away. But that was not completely waterproof; a READ COMMITTED transaction
that's still holding onto an old snapshot is no different from a
serializable transaction. There's a gap between acquiring a snapshot and
locking a table, so even if a READ COMMITTED backend is not holding a lock
on a table right now, it might open it later. To fix that, remove the
distinction between serializable and other transactions. Any open
transaction will now prevent AO segfiles in awaiting-drop state from being
removed. This commit adds an isolation2 test case called
'vacuum_self_function' to test one such case; it used to fail before this
commit.

That is a user-visible change in behavior that's worth calling out
explicitly. If there are concurrent transactions running, even in
READ COMMITTED mode, and even if they don't access the table at all,
VACUUM will no longer be able to recycle compacted segfiles. Next VACUUM
will clean them up, after the old concurrent transactions. Meanwhile,
the awaiting-drop segfiles will use up disk space, and increase the risk
that an insertion fails because all segfiles are in use.

This work was originally submitted github PR #790; you might see references
to that in the discussions.

Code change details
-------------------

This gets rid of the in-memory hash table to track AO segment status. To
lock a segfile for insertion or other operations, we now rely on tuple
locks on the AO segment table instead. When an AO segment is chosen as
insertion target, its AO segment table row is locked for the transaction
(by heap_lock_tuple, i.e. by stamping its XMAX). Later, when the insertion
finishes, the tuple is updated with the new EOF (no change there). When
choosing a new insertion target segfile, we scan the pg_aoseg table instead
of the hash table, and use the xmin/xmax to determine which ones are safe
to use. The process of choosing a target segment itself is not
concurrency-safe, so the relation extension lock is held to protect that.
See new "Locking and snapshots" section in
src/backend/access/appendonly/README.md for details.

tupcount and modcount are not kept up-to-date in the master. Remove all the
code needed to send the totals back from QE to QD at end of a command,
including the GPDB-specific 'o' protocol FE/BE message.

Remove the machinery to track which PROCs are in a serializable transaction.
It was not completely waterproof anyway, and there were some existing bugs
in that area. A READ COMMITTED transaction that's still holding onto an old
snapshot is no different from a serializable transaction. This commit adds
an isolation2 test case called 'vacuum_self_function' to test one such case;
it used to fail before this commit.

Move responsibility of creating AO segfile to callers of
appendonly_insert_init(). The caller must have locked, and created if
necessary, the segfile by calling ChooseSegnoForWrite() or
LockSegnoForWrite(). Previously, the pg_aoseg was not locked in rewriting
ALTER TABLE and other such commands that always used segfile #0. That was
OK, because they held an AccessExclusiveLock on the table, but let's be
consistent.

When selecting the insertion target for VACUUM compaction, prefer creating
a new segfile over using an existing non-empty segfile. Existing empty
segfiles are still choice #1, but let's avoid writing to non-empty
existing files, because otherwise we cannot vacuum those segfiles in the
same VACUUM cycle.

Warn, if compaction fails because no insertion segfile could be chosen.
This used to be an ERROR in the old code, but I think a WARNING is more
appropriate.

In targetid_get_partition(), don't create duplicate ResultRelInfo entries.
It's possible that the "result relation" is a partition of a partitioned
table, rather than the root. Don't create a new ResultRelInfo hash entry
for it in that case, use the parent ResultRelInfo. This was harmless
before, but now the logic of when we need to increment the modcount was
getting confused. Per the 'qp_dropped_cols' regression test.

Test change details
-------------------

src/test/regress/greenplum_schedule

    Run uao[cs]_catalog_tables tests separately.

    Because we are now more lazy with dropping AWAITING_DROP segments, the
    issue mentioned in the comment with serializable transactions now
    applies to all transactions.

uao_compaction/threshold
uaocs_compaction/threshold

    Adjust uao[cs]_compaction/threshold tests case to make it pass.

    In the first few changed outputs, I'm not entirely sure why one segment
    chooses to insert the new tuples to segfile #2 while others pick
    segfile #1, but there's also nothing wrong with that. The point of this
    PR is that QE nodes can make the decision independently.

    In the last test with gp_toolkit.__gp_aovisimap_compaction_info(),
    all the test data is inserted to one QE node, so segfile 2 isn't
    allocated on others.

isolation2/vacuum_self_function

    Add test case for the non-SERIALIZABLE case discussed earlier.

    See also discussion at
    https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ

    This test errored out before this commit:

    ERROR:  read beyond eof in table "ao" file "base/16384/16387.1", read position 0 (small offset 0), actual read length 0 (large read length 192) (cdbbufferedread.c:211)

    Works with this PR, but we don't drop old AO segments as aggressively
    in vacuum anymore. Partly because we're now more careful and correct.
    But also because we don't detect the case that the VACUUM is the oldest
    transaction in the system, and nothing else is running; we miss the
    opportunity to drop compacted AO segments immediately in that case. The
    next VACUUM should drop them though.

    Furthermore, this test is written so that if we tried to perform AO
    compaction phase in a local rather than distributed transaction, it
    would fail. (I know because I tried to do it that way at first.) This
    is "Problem 3" that I mentioned on that gpdb-dev thread.

isolation2/fsync_ao

  Adjust "hit counts" in fsync_ao test.

  The number of times the fsyncs happen have changed. To be honest, I'm not
  sure why, nor what the principled way to get the correct value would be,
  but as long as the counts are > 0, this looks like success to me.

isolation2/gdd/planner_insert_while_vacuum_drop

  Remove GDD test that's not relevant anymore.

  The test was testing a deadlock between QD and QE, involving AO vacuum.
  We no longer take an AccessExclusiveLock in the QD, so the test case
  doesn't work. I'm not sure what an equivalent deadlock might be with the
  new AO vacuum, so just remove the test.

isolation2/vacuum_drop_phase_ao
isolation2/uao/insert_should_not_use_awaiting_drop

    These tests don't make sense in the QE-managed regime anymore. I
    couldn't figure out what these should look like with this commit, so
    remove.

isolation2/uao/insert_policy

    Update test case to cover a scenario where two transactions try to
    create the initial segment for relation. We had a bug in that scenario
    during development, and it wasn't covered by any existing tests.

isolation2/uao/select_while_vacuum_serializable2

    The behavior is different now, a serializable transaction doesn't
    prevent compaction phase from happening, only the recycling of the
    segfile. Adjust expected output accordingly.

isolation2/uao/compaction_utility

    VACUUM in utility mode can compact now.

Co-authored-by: Abhijit Subramanya <asubramanya@pivotal.io>
Co-authored-by: Ashwin Agrawal <aagrawal@pivotal.io>
Co-authored-by: Asim R P <apraveen@pivotal.io>
Co-authored-by: Xin Zhang <xzhang@pivotal.io>
Reviewed-by: Georgios Kokolatos <gkokolatos@pivotal.io>
Reviewed-by: Taylor Vesely <tvesely@pivotal.io>

Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ
Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/lUwtNxaT3f0/Cg39CP8uAwAJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants