Skip to content

Conversation

@nk1506
Copy link
Contributor

@nk1506 nk1506 commented Mar 19, 2024

Hive considers iceberg-view just another type of Hive-Table. So Iceberg-View on Hive should have same strategy for commit as Iceberg-Table on Hive. As part of this refactoring all the common code for table which can be used by view too has been moved to common places. Now HiveOperationsBase has all the common code for committing the iceberg-table and iceberg-view.

This change also define a new abstract class BaseMetastoreOperations which can contains common code from BaseMetastoreTableOperations and BaseViewOperations .

Motivation behind this change to have less diff for implementing Iceberg-view on Hive.

protected void doCommit(TableMetadata base, TableMetadata metadata) {
boolean newTable = base == null;
String newMetadataLocation = writeNewMetadataIfRequired(newTable, metadata);
boolean hiveEngineEnabled = hiveEngineEnabled(metadata, conf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this common code has been moved to HiveOperationsBase .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment in https://github.com/apache/iceberg/pull/10001/files#r1533430710. Please revert these changes here. Once view support is in, we can revisit what we'll extract out

@nk1506 nk1506 marked this pull request as ready for review March 19, 2024 10:57
&& e.getMessage().contains(String.format("new table %s already exists", to))) {
throw new org.apache.iceberg.exceptions.AlreadyExistsException(
"Table already exists: %s", to);
throwErrorForExistedToContent(fromIdentifier, removeCatalogName(toIdentifier));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do this here, as this would do another call to retrieve the table. I would leave the code here as it was before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation behind this changes is
ViewCatalogTests expects error message like Cannot rename %s to %s. %s already exists. It should explicitly tell the upstream TO tableType.

Also we are making only one call to retrieve TO content.
Nevertheless, What do you propose to match the error messages for ViewCatalogTests ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that we shouldn't be doing whather it's doing in throwErrorForExistedToContent() at this point and just throw the AlreadyExistsException as it was done previously. We don't want do add new functionality in this PR IMO

Maps.newHashMap(),
null,
null,
properties.getOrDefault(HiveCatalog.VIEW_ORIGINAL_TEXT, null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this adding view-specific things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not be specific to Hive-View. This is another field in Hive-Table . IMO, This field either we can ignore for both iceberg-table and view or make it optional with properties.
Ref: https://github.com/apache/hive/blame/df45194268ffb10f9f7aae0ab4e3aec35a7b31d8/standalone-metastore/metastore-common/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Table.java#L23

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR should be a pure refactoring and not add new things/functionality. The current code on main doesn't look at this property, so why do we need to check this property as part of this PR?

@nk1506
Copy link
Contributor Author

nk1506 commented Mar 21, 2024

@nastra , Please let me know if there are more comments ?

* @param newMetadataLocation newly written metadata location
* @return true if the new metadata location presents with current or previous metadata files.
*/
protected boolean checkCurrentMetadataLocation(String newMetadataLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not Private , bcoz It should be available outside of the core-package. Not public as it should be only scoped too child class. Do we need public privilege here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be private. Why would it need to be available for subclasses? I don't think subclasses would call this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how the checkCommitStatus is being called. It it is using
BaseMetastoreTableOperations.checkCommitStatus it can be private . it it is using BaseMetastoreOperations.checkCommitStatus it should be protected.

* Checks the new metadata location presents or not after refreshing the table.
*
* @param newMetadataLocation newly written metadata location
* @return true if the new metadata location presents with current or previous metadata files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return true if the new metadata location presents with current or previous metadata files.
* @return true if the new metadata location is the current metadata location or present within previous metadata files.

please also update the javadoc in L317 with a similar sentence as here

TableIdentifier from,
TableIdentifier originalTo,
HiveOperationsBase.ContentType contentType) {
Preconditions.checkArgument(isValidIdentifier(from), "Invalid identifier: %s", from);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing existing behavior. Previously it would throw NoSuchTableException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent here was to have common exception for Table/View. But I think with View impl we will have to revisit this.

}

@SuppressWarnings("checkstyle:CyclomaticComplexity")
default void commitWithHiveLock(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we revert any changes in this class starting from L276 to L473? I feel like we're extracting too much functionality into a common place and some things are slightly different for tables/views later.
That's why I think it would be better to revert L276 to L473 at this point. In the first iteration of view support it's probably ok if some things are duplicated and once View support for Hive is in, we can do another round of refactoring to extract what's really common

return null;
}
}
public void setHmsParameters(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been split into two parts , one is specific to Table like snapshot,stats, etc. Other pat is common parameters as schema, location, uuid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this to how it was before

@nk1506 nk1506 force-pushed the hive_common_for_view branch from 90e8aaa to ac26860 Compare March 21, 2024 12:12
tableName,
e);
commitStatus = checkCommitStatus(newMetadataLocation, metadata);
commitStatus =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can use the original method from BaseMetastoreTableOperations.checkCommitStatus too.

…us instead of BaseMetastoreOperations.checkCommitStatus for Hive-Table.
@nk1506 nk1506 requested a review from nastra March 22, 2024 10:01
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class BaseMetastoreOperations {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class BaseMetastoreOperations {
public abstract class BaseMetastoreOperations {

public void fullTableNameWithDifferentValues() {
String uriTypeCatalogName = "thrift://host:port/db.table";
String nameSpaceWithOneLevel = "ns";
String nameSpaceWithTwoLevel = "ns.l2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String nameSpaceWithTwoLevel = "ns.l2";
String namespaceWithTwoLevels = "ns.l2";

@Test
public void fullTableNameWithDifferentValues() {
String uriTypeCatalogName = "thrift://host:port/db.table";
String nameSpaceWithOneLevel = "ns";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String nameSpaceWithOneLevel = "ns";
String namespace = "ns";

try {
Table table = clients.run(client -> client.getTable(fromDatabase, fromName));
HiveOperationsBase.validateTableIsIceberg(table, fullTableName(name, from));
validateTableIsIcebergTableOrView(contentType, name, table, from);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than passing the catalog name / table / from separately, it's probably better to just pass the table and the fullTableName: fullTableName(name, from)


private void validateTableIsIcebergTableOrView(
HiveOperationsBase.ContentType contentType,
String catalogName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String catalogName,

HiveOperationsBase.ContentType contentType,
String catalogName,
Table table,
TableIdentifier identifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TableIdentifier identifier) {
String fullTableName) {


String table();

String catalogName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be introducing catalogName()/ contentType() to this API just for printing it in a TRACE log inside loadHmsTable(). Please remove catalogName()/ contentType() from this API

return metaClients().run(client -> client.getTable(database(), table()));
} catch (NoSuchObjectException nte) {
LOG.trace(
"{} not found {}", contentType(), catalogName() + "." + database() + "." + table(), nte);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"{} not found {}", contentType(), catalogName() + "." + database() + "." + table(), nte);
"{} not found {}", contentType(), database() + "." + table(), nte);

this is just TRACE logging, so it should be ok to not have the catalog. I'm not even sure that logging stmt should even exist

Optional.ofNullable(tbl.getParameters()).orElseGet(Maps::newHashMap);

if (!obsoleteProps.contains(TableProperties.UUID) && uuid != null) {
parameters.put(TableProperties.UUID, uuid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/apache/iceberg/pull/9852/files#diff-db46657b84d66e084e15f31b8dab21577efb2ae7102863f94c6c9477782de676R206 you're passing the View UUID and I wonder whether this is actually correct to do so?

This is another example where it can be dangerous to have a common method to set something that seems could be common across tables/views.

I would suggest to remove setCommonHmsParameters() from this API

return maxHiveTablePropertySize() > 0;
}

default void setSchema(TableMetadata metadata, Map<String, String> parameters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to keep the original method (for backwards compatibiliy), but the method can refer to the new setSchema() method:

default void setSchema(TableMetadata metadata, Map<String, String> parameters) {
    setSchema(metadata.schema(), parameters);
  }

}
}

static StorageDescriptor storageDescriptor(TableMetadata metadata, boolean hiveEngineEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, we need to keep StorageDescriptor storageDescriptor(TableMetadata metadata, boolean hiveEngineEnabled) but it can internally call static StorageDescriptor storageDescriptor(Schema schema, String location, boolean hiveEngineEnabled)

public void setHmsParameters(
TableMetadata metadata, Table tbl, String newMetadataLocation, Set<String> obsoleteProps) {

private void setHmsTableParameters(
Copy link
Contributor

@nastra nastra Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert these changes in this method

}

@Override
public String catalogName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove these two methods

.doesNotContainKey(CURRENT_SNAPSHOT_TIMESTAMP);

ops.setSchema(metadata, parameters);
ops.setSchema(metadata.schema(), parameters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ops.setSchema(metadata.schema(), parameters);
ops.setSchema(metadata, parameters);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to change this, as we need to keep the original method signature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants