<var id="hmntx"><tr id="hmntx"></tr></var>

溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊(cè)×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請(qǐng)使用微信掃描上方二維碼

使用幫助

請(qǐng)求超時(shí)！

請(qǐng)點(diǎn)擊重新獲取二維碼

Hive中的ObjectInspector設(shè)計(jì)

發(fā)布時(shí)間：2020-07-26 11:48:52 來(lái)源：網(wǎng)絡(luò) 閱讀：1096 作者：choulanlan 欄目：大數(shù)據(jù)

ObjectInspector是Hive中一個(gè)咋一看比較令人困惑的概念，當(dāng)初讀Hive源代碼時(shí)，花了很長(zhǎng)時(shí)間才理解。當(dāng)讀懂之后，發(fā)現(xiàn)ObjectInspector作用相當(dāng)大，它解耦了數(shù)據(jù)使用和數(shù)據(jù)格式，從而提高了代碼的復(fù)用程度。簡(jiǎn)單的說(shuō)，ObjectInspector接口使得Hive可以不拘泥于一種特定數(shù)據(jù)格式，使得數(shù)據(jù)流 1）在輸入端和輸出端切換不同的輸入/輸出格式 2）在不同的Operator上使用不同的數(shù)據(jù)格式。

這是ObjectInspector interface
public interface ObjectInspector extends Cloneable {
public static enum Category {
    PRIMITIVE, LIST, MAP, STRUCT, UNION
};

String getTypeName();

Category getCategory();
}

這個(gè)interface提供了最一般的方法 getTypeName 和 getCategory。我們?cè)賮?lái)看它的子抽象類和interface：
StructObjectInspector
MapObjectInspector
ListObjectInspector
PrimitiveObjectInspector
UnionObjectInspector

其中，PrimitiveObjectInspector用來(lái)完成對(duì)基本數(shù)據(jù)類型的解析，而StructObjectInspector用了完成對(duì)一行數(shù)據(jù)的解析，它本身有一組ObjectInspector組成。由于Hive支持Nested Data Structure，所以，在StructObjectInspector中又可以（一層或多層的）嵌套任意的ObjectInspector。 Struct, Map, List, Union是Hive支持的4種集合數(shù)據(jù)類型，比如某一列的數(shù)據(jù)可以被聲明為Struct類型，這樣解析這一列的StructObjectInspector中就會(huì)嵌套了另一個(gè)StructObjectInspector。

現(xiàn)在我們可以從一個(gè)小例子看看ObjectInspector是如何工作的，這是一個(gè)Hive SerDe的測(cè)試用例代碼：

/**
   * Test the LazySimpleSerDe class.
   */
public void testLazySimpleSerDe() throws Throwable {
    try {
      // Create the SerDe
      LazySimpleSerDe serDe = new LazySimpleSerDe();
      Configuration conf = new Configuration();
      Properties tbl = createProperties();
      //用Properties初始化serDe
      serDe.initialize(conf, tbl);

      // Data
      Text t = new Text("123\t456\t789\t1000\t5.3\thive and hadoop\t1.\tNULL");
      String s = "123\t456\t789\t1000\t5.3\thive and hadoop\tNULL\tNULL";
      Object[] expectedFieldsData = {new ByteWritable((byte) 123),
          new ShortWritable((short) 456), new IntWritable(789),
          new LongWritable(1000), new DoubleWritable(5.3),
          new Text("hive and hadoop"), null, null};

      // Test
      deserializeAndSerialize(serDe, t, s, expectedFieldsData);
    } catch (Throwable e) {
      e.printStackTrace();
      throw e;
    }
}

   private void deserializeAndSerialize(LazySimpleSerDe serDe, Text t, String s,
      Object[] expectedFieldsData) throws SerDeException {
    // Get the row ObjectInspector
    StructObjectInspector oi = (StructObjectInspector) serDe
        .getObjectInspector();
    // 獲取列信息
    List<? extends StructField> fieldRefs = oi.getAllStructFieldRefs();
    assertEquals(8, fieldRefs.size());

    // Deserialize
    Object row = serDe.deserialize(t);
    for (int i = 0; i < fieldRefs.size(); i++) {
      Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
      if (fieldData != null) {
        fieldData = ((LazyPrimitive) fieldData).getWritableObject();
      }
      assertEquals("Field " + i, expectedFieldsData[i], fieldData);
    }
    // Serialize
    assertEquals(Text.class, serDe.getSerializedClass());
    Text serializedText = (Text) serDe.serialize(row, oi);
    assertEquals("Serialized data", s, serializedText.toString());
}

//創(chuàng)建schema，保存在Properties中
private Properties createProperties() {
    Properties tbl = new Properties();

    // Set the configuration parameters
    tbl.setProperty(Constants.SERIALIZATION_FORMAT, "9");
    tbl.setProperty("columns",
        "abyte,ashort,aint,along,adouble,astring,anullint,anullstring");
    tbl.setProperty("columns.types",
        "tinyint:smallint:int:bigint:double:string:int:string");
    tbl.setProperty(Constants.SERIALIZATION_NULL_FORMAT, "NULL");
    return tbl;
}

從這個(gè)例子中，不難出，Hive將對(duì)行中列的讀取和行的存儲(chǔ)方式解耦和了，只有ObjectInspector清楚行和行中的列是怎樣存取的，但使用者并不知道存儲(chǔ)的細(xì)節(jié)。對(duì)于數(shù)據(jù)的使用者來(lái)說(shuō)，只需要行的Object和相應(yīng)的ObjectInspector，就能讀取出每一列的對(duì)象。

這段代碼再清晰不過(guò)了，ObjectInspector oi控制了對(duì)列的Access
for (int i = 0; i < fieldRefs.size(); i++) {
      Object fieldData = oi.getStructFieldData(row, fieldRefs.get(i));
      if (fieldData != null) {
        fieldData = ((LazyPrimitive) fieldData).getWritableObject();
      }
      assertEquals("Field " + i, expectedFieldsData[i], fieldData);
}

這段代碼的作用是把一行deserialize，然后再serialize
    Object row = serDe.deserialize(t);
    Text serializedText = (Text) serDe.serialize(row, oi);
由此不難看出，只要有了不同的SerDe對(duì)象，可以很容易的將一條數(shù)據(jù)deserialize，然后再serialize成不同的格式，從而非常方便的實(shí)現(xiàn)數(shù)據(jù)格式的切換。

理解了上面的例子，就不難理解為什么所有的Hive ExprNodeEvaluator 和 UDF，UDAF, UDTF 都需要 (Object, ObjectInspector) pair了。數(shù)據(jù)存儲(chǔ)細(xì)節(jié)和使用的分離，使得Hive不需要針對(duì)不同的數(shù)據(jù)格式對(duì)同一個(gè)UDF, UDAF 或UDTF實(shí)現(xiàn)不同的版本，這些函數(shù)看到的只是WritableObject！

下面是表達(dá)式evaluator的interface：
/**
* ExprNodeEvaluator.
*
*/
public abstract class ExprNodeEvaluator {

/**
   * Initialize should be called once and only once. Return the ObjectInspector
   * for the return value, given the rowInspector.
   */
public abstract ObjectInspector initialize(ObjectInspector rowInspector) throws HiveException;

/**
   * Evaluate the expression given the row. This method should use the
   * rowInspector passed in from initialize to inspect the row object. The
   * return value will be inspected by the return value of initialize.
   */
public abstract Object evaluate(Object row) throws HiveException;

}

initialize中需要初始化ObjectInspector，返回輸出數(shù)據(jù)的ObjectInspector（它負(fù)責(zé)解析evaluate method返回的對(duì)象）；而每次evaluate call傳進(jìn)來(lái)一條Object數(shù)據(jù)，它的解析由ObjectInspector負(fù)責(zé)。

接下來(lái)是GenericUDF抽象類：
public abstract class GenericUDF {

/**
   * A Defered Object allows us to do lazy-evaluation and short-circuiting.
   * GenericUDF use DeferedObject to pass arguments.
   */
public static interface DeferredObject {
    Object get() throws HiveException;
};

/**
   * The constructor.
   */
public GenericUDF() {
}

/**
   * Initialize this GenericUDF. This will be called once and only once per
   * GenericUDF instance.
   *
   * @param arguments
   *          The ObjectInspector for the arguments
   * @throws UDFArgumentException
   *           Thrown when arguments have wrong types, wrong length, etc.
   * @return The ObjectInspector for the return value
   */
public abstract ObjectInspector initialize(ObjectInspector[] arguments)
      throws UDFArgumentException;

/**
   * Evaluate the GenericUDF with the arguments.
   *
   * @param arguments
   *          The arguments as DeferedObject, use DeferedObject.get() to get the
   *          actual argument Object. The Objects can be inspected by the
   *          ObjectInspectors passed in the initialize call.
   * @return The
   */
public abstract Object evaluate(DeferredObject[] arguments)
      throws HiveException;

/**
   * Get the String to be displayed in explain.
   */
public abstract String getDisplayString(String[] children);

}

它的機(jī)制與evaluator非常類似，初始化中敲定ObjectInspector數(shù)組，它們負(fù)責(zé)解析輸入，返回output數(shù)據(jù)(即evaluator method返回的Object)的ObjectInspector；每次evaluate call傳進(jìn)一個(gè)Object數(shù)組，返回一條數(shù)據(jù)。

Hive支持LazySimple, LazyBinary，Thrift等不同的數(shù)據(jù)格式，同一個(gè)查詢計(jì)劃中，可以在operator上切換數(shù)據(jù)流的格式。比較常見(jiàn)的是在Mapper端使用LazySimpleSerDe，Mapper輸出的數(shù)據(jù)使用LazyBinarySerDe，因?yàn)閎inary格式比較節(jié)省空間，從而減少repartition時(shí)的網(wǎng)絡(luò)傳輸。如果你想看查詢計(jì)劃的每一步到底使用了哪一種SerDe格式，只要用"Explain Extended"就可以查清楚了。

向AI問(wèn)一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱：is@yisu.com進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
OpenStack stein安裝（四）placement
下一篇新聞：
CountDownTimer倒計(jì)時(shí)器的使用

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動(dòng)

幫助支持

關(guān)于我們

售后咨詢

7*24小時(shí)在線電話：400-100-2938

7*24小時(shí)在線 QQ：800811969

關(guān)注億速云

億速云公眾號(hào)

手機(jī)網(wǎng)站二維碼