Advantech HDD PMQ

From ESS-WIKI
Jump to: navigation, search

HDD Failure Prediction Model

1. Get the HDD raw data

2. Get useful HDD information

RTENOTITLE


3. Flow of training mathematics module

RTENOTITLE



Using Logistic Regression Algorithm to classify the false samples in red and the true samples in blue.

RTENOTITLE


4. Benchmark for PMQ accuracy 

RTENOTITLE

5. Alert and Suggestion of HDD PMQ 

  1. of Event
Event Message Suggestion ( Action ) Message Condition
1 Over temperature

Lower system temperature (<40°C)

smart5 >= 10 || smart197 >= 2
2 Disk aging. Backup data to new disk. smart9 >= 26280
3 Disk read/writes error frequently.

Backup data to new disk.

smart187 >= 1
4 Power failure Check power source smart192 >= 190

Visualization and Maintenance with WISE-PaaS/RMM

If Advantech PMQ service running on edge device. There will have a  PMQ icon ( RED frame ) as below picture on WISE-PaaS/RMM device manager. To click this icon then pop up PMQ Dialog as "Predictive Goo/Failure".

WISE-PaaSToPMQ

Predictive Good

Predictive Failure

Install / Upgrade

Install HDD_PMQ Agent in Windows System

Execute "Agent_HDD_PMQ.exe" to install the HDD_PMQ Agent, and follow the steps to finish the installation.

RTENOTITLE

RTENOTITLE

RTENOTITLE

RTENOTITLE

RTENOTITLE

Install HDD_PMQ Agent in RMM 3.3 Agent Plus version in Windows System

RTENOTITLE

The RMM 3.3 Agent Plus installer provide three different type installations: Typical, Custom and Complete.

RTENOTITLE

The RMM 3.3 Agent Plus include three program features:

1. MQTT Broker for internal MQTT bus.

2. RMM 3.3 Agent to communicate between WISE-PaaS/RMM Server and MQTT bus.

3. HDD_PMQ Agent for HDD smart retrieve and PMQ caculate then report to MQTT bus.

RTENOTITLE

RTENOTITLE

In rest to steps, the installer will trigger the MQTT Broker, RMM 3.3 Agent and HDD_PMQ Agent installer sequentially.

RTENOTITLE

RTENOTITLE

The detail of RMM 3.3 Agent Installation is skipped to focused on RMM 3.3 Agent Plus Installation.

RTENOTITLE

RTENOTITLE

Update HDD_PMQ Agent

Upload Agent_HDD_PMQ-V1.0.xxx...zip OTA upgrade package to OTA Server.

Upload PMQ.png

Upload PMQ 1.png

Select HDD_PQM Update Package to upgrade target devices.

RTENOTITLE

RTENOTITLE

Data Format / Event / Action

    Advantech defines 6 data categories for PMQ service. Use general JSON format suit for any PMQ solution. Customer can design his PMQ data format follow our rule in JSON. It is easy to integrate with Advantech EIS and WISE-PaaS/RMM.


Syntax for PMQ Data & Predictive Result

    info: 
       type              : PMQ ( Fixed: must )
       name            : Name of this service         ( must )
      description    : Description of this service  ( must )
      version          : version of this service      ( must )

      confidence level    : confidence level of predictive algorithm ( must )

      update          : receive update cmd           ( option )
      *User-Defined : user can define its own tag : value 

  
    data: raw data of the PMQ service

        - *User-Defined ( in JSON Object )
        

   predict:    predict result

          - Failure rate: failure rate of the prediction result in ( 0 ~ 100 % )  ( must )  

            Please remapping your predict failure rate as below normalize range.

            Level: Good ( Green ): 0 ~ 54%, Warning ( Yellow ): 55 ~ 66%,  Bad ( Red ): 67 ~ 100%
            

   event: evnet of the PMQ Service
        -  {"n":"e1","sv":"Hard disk long-term operation in more than 40°C or vibration environment.","actionlist":"a1", "asm":"r"}

  action: action of the PMQ service
        - {"n":"a1", "sv":"Please reduce the ambient temperature to 40 °C or less or operation at stable environment.", "asm":"r"}
          "ActionLog"  "sv":"reboot + backup"     : 

    param: parameters of the PMQ service
       - {"n":"predict period", "v":60, "asm":"r", "u":"sec", "min":10, "max":86400}


Pre defined Tag:

   n: Name of resource

  bn: Base Name

   v: value

   bv: bool value

   sv: string value

   u: Unit of resource

   list:  Attribute with a value of type array

    threshold: A level, rate, or amount at which something comes into effect

    min: Minimuze value of the resource

    max: Maximum value of the resource

    msg: description of the resource

    asm: read / wirte 

          

Example of HDD PMQ Data Format

{
   "HDD_PMQ": {
      "info":{
             "e":[{"n":"type", "sv":"PMQ", "asm":"r"},
                  {"n":"name", "sv":"HDD_PMQ", "asm":"r"},
                  {"n":"description", "sv":"This service is HDD PMQ Service", "asm":"r"},
                  {"n":"version", "sv":"1.0.2", "asm":"r"},
                  {"n":"confidence level", "v":83.12, "asm":"r", "u":"%"},
                  {"n":"update", "sv":"", "asm":"rw"},
                  {"n":"eventNotify", "bv":true, "asm":"r"}],
          "bn":"info"
      },

      "data":{
             "list":[{"bn":"WDC WD3200BUCT-63TWBY0","e":[{"n":"Smart 5","v":0,"min":0, "max":20, "threshold":10, "u":"count", "msg":"Reallocated Sector Count", "asm":"r"},
                                                         {"n":"Smart 9", "v":128, "min":0, "max":35000, "threshold":26280, "u":"hr", "msg":"Power-On Hours", "asm":"r"},
                                                         {"n":"Smart 187", "v":0, "min":0, "max":5, "threshold":1, "u":"count", "msg":"Reported Uncorrectable Errors", "asm":"r"},
                                                         {"n":"Smart 192", "v":10, "min":0, "max":400, "threshold":190, "u":"number", "msg":"Power-off Retract Count", "asm":"r"},
                                                         {"n":"Smart 197", "v":0, "min":0, "max":10, "threshold":2, "u":"count", "msg":"Current Pending Sector Count", "asm":"r"},
                                                         {"n":"Smart 198", "v":2, "min":0, "max":40, "threshold":10, "u":"count", "msg":"Uncorrectable Sector Count", "asm":"r"}]},

                     {"bn":"ST3500320AS0","e":[{"n":"Smart 5","v":1,"min":0, "max":20, "threshold":10, "u":"count", "msg":"Reallocated Sector Count", "asm":"r"},
                                               {"n":"Smart 9", "v":8832, "min":0, "max":35000, "threshold":26280, "u":"hr", "msg":"Power-On Hours", "asm":"r"},
                                               {"n":"Smart 187", "v":0, "min":0, "max":5, "threshold":1, "u":"count", "msg":"Reported Uncorrectable Errors", "asm":"r"},
                                               {"n":"Smart 192", "v":100, "min":0, "max":400, "threshold":190, "u":"number", "msg":"Power-off Retract Count", "asm":"r"},
                                               {"n":"Smart 197", "v":0, "min":0, "max":10, "threshold":2, "u":"count", "msg":"Current Pending Sector Count", "asm":"r"},
                                               {"n":"Smart 198", "v":5, "min":0, "max":40, "threshold":10, "u":"count", "msg":"Uncorrectable Sector Count", "asm":"r"}]}],
           "bn":"data"
      },

      "predict":{
                "list":[{"bn":"WDC WD3200BUCT-63TWBY0","e":[{"n":"Failure rate","v":20,"min":0, "max":100,"asm":"r"},
                                                            {"n":"hddpredict", "v":0.15, "min":0, "max":1, "threshold":0.385, "asm":"r"}]},
                        {"bn":"ST3500320AS0","e":[{"n":"Failure rate","v":40,"min":0, "max":100,"asm":"r"}, 
                                                  {"n":"hddpredict", "v":0.25, "min":0, "max":1, "threshold":0.385, "asm":"r"}]}],
          "bn":"predict"
      },

      "event":{
                "e":[{"n":"e1","sv":"HDD back to Normal", "actionlist":"", "asm":"r"},
                     {"n":"e2","sv":"Over temperature.","actionlist":"a1", "asm":"r"},
                     {"n":"e3","sv":"Disk aging.","actionlist":"a2", "asm":"r"},
                     {"n":"e4","sv":"Disk read/writes error frequently.","actionlist":"a2", "asm":"r"},
                     {"n":"e5","sv":"Power failure.","actionlist":"a3", "asm":"r"}],
          "bn":"event"
      },

      "action": {
                 "e":[{"n":"a1", "bv":false, "msg":"Lower system temperature ( < 40 Celsius ).", "asm":"r"},
                      {"n":"a2", "bv":false, "msg":"Backup data to new disk", "asm":"r"},
                      {"n":"a3", "bv":false, "msg":"Check power sourc.", "asm":"r"},
                      {"n":"ActionLog", "sv":"", "asm":"r"}],
           "bn":"action"
      },

      "param": {
                 "e":[{"n":"report interval", "v":60, "min":10, "max":3600, "asm":"rw", "u":"sec"},
                      {"n":"enable report", "bv":true, "asm":"rw"}],
           "bn":"param"
      },
      
      "opTS":{"$date":1494554251000}
   },

   "bn":"HDD_PMQ"

}

Syntax for EventNotify

PMQ severity : 4 Warning

severity: 
      Severity_Emergency = 0, 
      Severity_Alert = 1, 
      Severity_Critical = 2, 
      Severity_Error = 3, 
      Severity_Warning = 4, 
      Severity_Informational = 5, 
      Severity_Debug = 6, 

subtype: predict

Example : Predict Fail event

{
    "susiCommData": {
        "commCmd": 2059,
        "requestID": 2001,
        "agentID": "AAAAA",
        "handlerName": "general",
        "sendTS": 1453356274,
        "eventnotify": {
            "subtype": "predict",
            "msg": "Over temperature.",
            "severity": 4,
            "handler": "HDD_PMQ",
            "extMsg": {
                "n": "WDC WD3200BUCT-63TWBY0",
                "eventID":"e2"
            }
        }
    }
}

Example: Predict back to Good event

{
    "susiCommData": {
        "commCmd": 2059,
        "requestID": 2001,
        "agentID": "AAAAA",
        "handlerName": "general",
        "sendTS": 1453356274,
        "eventnotify": {
            "subtype": "predict",
            "msg": "HDD back to Normal",
            "severity": 5,
            "handler": "HDD_PMQ",
            "extMsg": {
                "n": "WDC WD3200BUCT-63TWBY0",
                "eventID":"e1"
            }
        }
    }
}