-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Hello libceed developers,
Thank you for the library, it's very impressive and extremely enabling for Palace, we're using it for almost all of our linear algebra at this stage. Given that, we've been investigating some data structure optimizations and have some questions about how the padded backends work:
- Where is the padding for a blocked backend defined, and how does the "padded data" get computed from the valid data, whilst not being accumulated/used for calculations in the output vector?
- How is this padding affected by the choice of stride pattern a user provides? Are users allowed to specify strides that aren't
CEED_STRIDES_BACKENDinCeedElemRestrictionCreateStridedand work with blocked backends? - Is changing the data storage pattern for a passive vector from all of i-th component then all of (i+1)-th etc. to all components for quad point j, then all for quad j+1, a bad idea for a non-cpu backend?
Long version:
We have QFunctions that use a passive vector where have stored the geometric factor data, computed in advance, along with two additional pieces of scalar data. We do this as there are many kernels that reuse the geometric factor data and have traded storage for time. This data is stored as a QVector, given there's nelem * nquad_per_elem * n_component (=11) pieces of data. An issue we've noticed through some cpu profiling however is that for the CEED_STRIDES_BACKEND the 11 components of the data are being stored with a stride of Q in the data vector, which for a large enough Q (88 for /cpu/self/ref/blocked for instance), is causing a cache miss for each value of the data drawn (this was all found using vtune, so we're moderately convinced it's a real effect). Namely all the 0-th component terms come first, then all the 1-th component, then all the 2-th etc, meaning we get ~10 L3 cache misses per quadrature point (resulting in a lot of DRAM calls).
Given this, we wanted to try changing the data storage to be coalesced by components, I'm not sure of the terminology for this within libCEED, but basically have all 11 components of data that are needed adjacent in the data vector, rather than all of the first components, then all of the second components etc. (thinking about the data vector as a matrix Q x 11, this amounts to row major vs column major storage). Thus a load for the first data component would almost certainly pull all the data for the given quadrature point into the cache in one go, and we'd no longer need a DRAM access (which is what has been observed with vtune).
Changing the strides from CEED_STRIDES_BACKEND to
CeedInt strides[3] = {
geom_data_size, // Stride between nodes
1, // Stride between components
geom_data_size * num_qpts // Stride between elements
};
and our writing into the storage vector gives the expected results for the non blocked backends, for a range of polynomial orders and various different QFunctions which are all storing this data, so I am fairly convinced we are populating the qdata vector appropriately (though there's some question of what happens for blocking in the final block, which I'll get to).
The mesh I am testing with has 14362 elements, and 11 quad points per element (tets), making up 157982 total entries in the qvector. Printing the Q=88 fed into the QFunction, which comes from /cpu/self/ref/blocked for instance, rem(157982 * 11, 88) = 22, meaning on the last block of 88 quadrature points being processed my mental model is there are only 22 valid quadrature points, the rest are padded with data that shouldn't be accumulated (I'm surmising this, I've not been able to disentangle it from the source so far). Presumably made possible by filling the active vector with zero for these quadrature points, making the data irrelevant at the cost of a few flops (but crucially no branch prediction required).
When I write out the contents of the in[0] (the passive vector that we are storing the data in) to the terminal as a Q x 11 matrix (each column being one of the 11 components of data I want), then I see there are 22 rows with a valid 0-column entry (this particular example it is easy to see if these entries are valid, they should be 1.000e+00, the other values are all smaller and floating point), the data for the first few rows looks like
[0-th row] 1.000e+00, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.066e+01, -2.607e+02, -2.607e+02,
...
[2-th row] 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, -4.300e+01, -2.118e+01, -2.118e+01, -2.118e+01,
...
[8-th row] 1.000e+00, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.642e-09, 2.066e+01, -2.607e+02,
...
[11-th row] 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00, 1.000e+00*, -4.300e+01, -2.118e+01, -2.118e+01, -2.118e+01,
I've skipped the rows that are less interesting, but 2 things jump out here: 1) 1.00 gets repeated a lot on the 8 * i + 2-th rows, and 2) 2.642e-09 a lot on the 8 * i-th rows. From this it seems like there's some copying of valid q point data, to create "padded" q point data, but the copying of these into the block is obeying a stride pattern that isn't the strides that was provided when assembling the qvector originally, or in the element restriction used to access it later, but I think corresponds to the CEED_STRIDES_BACKEND default which I believe maps to {1, geom_data_size, geom_data_size*num_qpts} in this case. If I write the data in the other stride pattern, there are whole rows with negative values repeated, things like:
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
-2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02, -2.607e+02,
does a negative value indicate the value was padded in?
How can we alter this padding behaviour to respect the stride pattern or is this a fundamental limitation and one can't work with custom stride pattern for a data vector if blocked backends are desired? We've have been looking through https://libceed.org/en/latest/ and the examples but to no avail (searching things like block), is there any documentation on how the blocked backends transform relative to the non-blocked backends?
Appendices:
- https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/hcurl_33_qf.h an example qfunction, the
adjJtis the matrix variable. - https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/geom_33_qf.h the existing writing of the quadrate point data to the vector for later access.
- https://github.com/awslabs/palace/blob/main/palace/fem/qfunctions/33/utils_33_qf.h#L39 unpacking the adjugate data into a stack variable, note the stride is
Q, where we're changing it to effectively being1(and skipping this copying entirely). - https://github.com/awslabs/palace/blob/main/palace/fem/mesh.cpp#L180-L189 calls to ceed with the default stride (wrapped in a helper for checking error codes, which can be ignored).