سایت تخصصی در مورد گرید کامپیوتینگ-grid computing-globus و سیستم های توزیع شده و علوم جديد كامپيوتر
Before using the techniques described in this article, make sure you are familiar with the six strategies for grid application enablement described in the following "Six strategies for grid application enablement" series of articles:
In this article, we introduce the basic architectural pattern that fits most cases when you enable existing code. Next, we discuss a few strategies to enable existing code based on the most common architectures we have encountered over the years. We introduce two basic scenarios for platform-specific distributed applications, and we will study the most common architectural variations of each scenario and the most common way to make these architectures fit into the enablement pattern. We conclude with two scenarios for Web-enabled applications (servlet-centric and database-centric).
Most enablement efforts for existing code using batch-oriented grid infrastructure software are similar. There's a pattern you can follow. You can use this pattern to achieve the first three strategies of grid adoption.
The base case for these three strategies for grid adoption is a program that takes command-line parameters and uses files or databases as specified in the command-line parameters. If the program is licensed, the grid infrastructure will require license management capabilities.
In general, the first three strategies to enable an existing application to run on a grid all require the user to send requests to a client application, which acts as a requester to the grid infrastructure. The client to the client application can be the actual user or a portal. The grid infrastructure takes care of deploying the actual application, which becomes a provider to the grid infrastructure. See Figure 1.
Figure 1. Integration pattern for enabling existing code
The key point is that the client program (job submission driver) should talk to the grid infrastructure as if it is talking directly to the application. The simplest way to do this is by having the client program issue command-line type instructions to the virtualized application.
Implementing this scenario is easy when the application is a stand-alone program with minimal deployment requirements. But when you're dealing with integrated applications, the whole thing might require a little creativity.
Enabling existing code using batch-oriented grid infrastructure software involves a finite number of known scenarios because of the state of the software industry today. Two scenarios exist, both involving essentially the same type of application:
|
Enabling platform-specific distributed applications
As previously mentioned, this scenario applies to client/server, transactional, and batch-oriented applications. Two architectural tendencies prevail in these three types of applications: monolithic and modular.
The case of monolithic applications
In general, deploying platform-specific monolithic applications on batch-oriented grid infrastructure software is as simple as installing the application on all grid nodes and writing some "glue code" that will integrate user requests, parameter passing, and program calls from the grid infrastructure. See Figure 2.
Figure 2. Deploying monolithic applications using the standard enablement pattern
This "glue code" we're talking about is what makes this an integration job. In most cases, you can integrate a monolithic application and a batch-oriented grid infrastructure product through scripting. You can use Perl, Python, or ordinary shell scripts to integrate user requests, parameter passing, and application calls within the context of the grid infrastructure software.
The caveat has to do with what a monolithic application does and how it does it. Monolithic applications tend to try to be all things to all people. That's one reason they're monolithic (modules are not trustworthy to some people). Sometimes, monolithic applications have, among other things, embedded grid functionality. Embedded items and the way in which such functionality was implemented will determine whether the application can run on a grid.
For instance, a monolithic application that does not have any built-in grid infrastructure functionality will be easier to enable than others. An example would be a similar application that, on top of doing what it is supposed to do, also takes care of tasks such as scheduling which instance processes what request, or which database tables need to be locked on behalf of a given user, or when transaction affinity needs to be enforced. However, we still need to look into how the built-in grid functionality was implemented.
If the built-in grid functionality can be turned off from within the application, then this monolithic piece of code will be able to run on top of batch-oriented grid infrastructure software with no problem. If, on the other hand, the grid functionality is built deep into the application and it cannot be turned off, then we have a problem.
In general, the extent of the work needed to turn off built-in grid infrastructure functionality is very high. When programmers were learning to reuse code, they also learned to abstract functionality both ways: up and down. Embedding grid infrastructure features in business logic frameworks is an example of abstracting functionality down.
We can't blame programmers for doing this. At the time when most platform-specific distributed applications were written, very few people were thinking about grid computing. At the time, nobody even considered the possibility of finding ready-made containers for their applications, much less complete grid infrastructures where they could just deploy their code and move on with their lives.
The case of modular applications
In general, enabling modular platform-specific distributed applications will be easier than enabling monolithic applications. The reason I say this is that modular applications give you choices on how to deploy them. There are caveats just as in the previous case, but with modular applications, easier ways to get around them.
One of the main advantages modular applications have is the possibility of turning off modules when necessary. This way, any environment-related functionality can be rendered to the grid infrastructure.
As in the case of monolithic applications, the existence of built-in grid features and whether they can be turned off or stripped out will also determine the degree of difficulty for the grid enablement effort.
The difference is that most modular applications, when they have any built-in grid features, will most likely concentrate that functionality in a single module, or a group of specialized modules. This should make it easier (in theory) to turn those modules off or to just eliminate them altogether.
This caveat is inter-module communication. The degree of difficulty in turning off, or stripping out, application modules depends on how the designers implemented inter-module communication. In general, the simpler the transport, the lower the degree of difficulty.
For instance, it is common for applications of this kind to have a dedicated module to handle all database calls. In some cases, the module not only acts as a universal database client by supporting ODBC or JDBC drivers for several vendors but also does something we can call "table access scheduling," which is sort of an intra-application table-locking mechanism that allows the application to handle table locking independent of the database.
Having a universal database client is a good idea. However, if the application is to be grid-enabled, it is better to leave table locking to the data grid infrastructure (let's assume that's what we're doing with the application). So, all we need to do is substitute the module for a regular database client, deploy the RDBMS into the data grid, and we've got ourselves a grid-enabled application.
Most database clients and listeners rely on TCP/IP sockets to get their orders from a program. The DB2® client listens by default on port 50000, for example. But what if the application designers decided that the tried-and-true way of TCP/IP sockets was not fancy enough for their application? What if they decided to go with a proprietary mechanism for inter-module? Then the problem is not so straightforward anymore.
There is another aspect to inter-module communication that can turn out to be a show-stopper. If the application is to be deployed on a computational grid, there can be several instances of several modules running concurrently on the grid. If the grid infrastructure software cannot relay the transport mechanism, or if the transport mechanism itself cannot function on a grid environment, the application simply will not work as expected.
Then, the degree of difficulty for the grid enablement project will be directly proportional to the effort of replacing the inter-module communication mechanism.
Deployment strategies: best-case scenario
An ideal modular application should handle inter-module communication with a dispatcher -- or broker -- module. This plan would allow modules to be deployed anywhere on the grid because inter-module communication will always happen through the broker. See Figure 3.
Figure 3. Ideal deployment for modular applications
The best-case scenario has two very desirable behaviors. First, application modules should be atomic to allow independent deployment from one another. The dispatcher-broker module should take care of all inter-module communication and data exchange.
As for shared libraries, the grid infrastructure should be able to handle them if they're installed as part of a system-wide installation. If not, they can be included as part of the provisioning policy for all modules so that all nodes have a local copy.
Second, application modules should be granular enough to allow for multiple instances of the same module to run concurrently (at least) on the same machine. A module should take care of its own results aggregation. Application-level results aggregation can be handled by the dispatcher-broker module or by through the database.
Deployment strategies: most-common scenario
Unfortunately, most modular applications are not that well behaved. In some cases, module encapsulation is not atomic enough to allow for true independent deployment. In other cases, the dispatcher-broker module doesn't take care of all inter-module communication and data exchange. Instead, some modules call on each other directly, which forces them to reside on the same machine.
Results aggregation also represents a problem, especially when the dispatcher-broker doesn't fully own the task of managing inter-module communication. Some modules might feed their results into other modules instead of just passing them back to the broker. Whatever the situation, the most common scenario is to deploy the entire application on all grid nodes as shown in Figure 4.
Figure 4. Most common scenario for deploying modular applications
To an extent, the most common scenario means that a modular application can be deployed as a monolithic application if worse comes to worst. It should work but the advantage of being modular will be lost because it won't be exploited by the grid infrastructure as in the ideal case.
In the same vein, think of a monolithic application as a single-module modular application and treat it as such when devising a strategy for managing results aggregation in the case of multiple concurrent instances.
These two cases, monolithic and modular applications, represent the simplest scenario when it comes to grid enabling platform-specific distributed applications. The situation changes dramatically when we deal with Web-enabled applications, as you'll see in the next section.
|
Enabling Web-enabled applications
A Web-enabled application is not a true J2EE application. We call "Web-enabled" those applications that were written originally as platform-specific distributed applications but run as Web applications, thanks to a Web front end.
The most common architectures are known as servlet-centric and database-centric, and each poses its own challenges to grid enablement.
Servlet-centric applications, in general, follow the architectural pattern illustrated in Figure 5.
Figure 5. Most common Servlet-centric architectural pattern
The platform-specific application in Figure 5 is, in some cases, patched up to support things such as XML and other technologies. It is enabled to talk to a Java™ Virtual Machine via JNI or a proprietary connector framework. As for database support, it is common to use ODBC-to-JDBC bridges or to just stay with ODBC.
When "ported" to run on J2EE application servers, servlet-centric applications interact with clients via a gateway servlet, which relays requests to the connector and thus to the actual application. A typical porting to an application server, such as IBM® WebSphere® Application Server, looks like the one illustrated in Figure 6.
Figure 6. Typical servlet-centric Web enablement strategy
Enabling an application that follows this execution pattern run on a grid would require at least the following steps:
The resulting scenario is illustrated in Figure 7.
Figure 7. Typical servlet-centric grid deployment
It might be necessary to change some of the original assumptions when implementing this scenario. For instance, a deployment like the one shown in Figure 7 would probably be easier to manage if the gateway Servlet ran on WebSphere Application Server Express, which doesn't include an EJB Container, as opposed to WebSphere Application Server Advanced Edition, or Apache Tomcat. A change of this nature would actually benefit your customers because it could lower the total cost of the application.
Keep in mind this is just one way of doing it. There may be better ways to architect the deployment pattern depending on the characteristics of the application. The best solution should provide the best returns in terms of feasibility, effort, usability, and administration.
The issues affecting monolithic and modular applications can also affect servlet-centric grid enablements. In addition, issues related to application performance can also arise.
Servlet-centric applications, in most cases, will experience performance problems at the Web container level. The use of a gateway servlet can create a sometimes nasty bottleneck that stems from, among other reasons:
These issues have nothing to do with the grid infrastructure. They can cause problems even if the application is not running on a grid. You need to be aware that you will have to tune the application server under the new conditions once the application is deployed on a grid. Other issues can stem from the interaction of the application server and the grid infrastructure. For instance:
Sometimes these problems, when they all surface at the same time, make the whole grid enablement exercise too complicated and too costly to be worth the effort. Sometimes it's better to consider the possibility of deploying the platform-specific piece of the application as a regular monolithic or modular application (whichever applies), or to rewrite the platform-specific piece as a J2EE-compliant set of components and look for an SOA-based grid infrastructure.
Database-centric applications became the de-facto standard back in the day of the client/server paradigm. Tools such as PowerBuilder, Oracle 2000, PacBase, Progress, and other products were widely used to create monolithic, footprint-heavy, and large applications that required proprietary languages and specialized skills to understand them.
Some products of this type managed to adapt to the new distributed paradigm and gave us these Web-enablement hybrids we know now as database-centric applications. Some vendors claim having re-engineered their products to be truly distributed and Web-native, but for some products, under the wraps, the old client/server, monolithic architecture remains untouched.
This situation is understandable given that most vendors even invented their own languages to describe highly complex frameworks aimed to facilitate what was called in those days "Rapid Application Prototyping and Development."
Regardless of the technical value of these products, vendors need to preserve this intellectual capital for one simple reason: A lot of money was invested in their development. Therefore, database-centric applications are going to be around for a long time.
Common implementations of database-centric applications revolve around proprietary frameworks. In most cases, these frameworks have been "patched" to support Java technology, XML, and JMS, on other now-popular industry standards. Figure 8 illustrates the most common flavor of this architecture.
Figure 8. Most common database-centric architectural pattern
In most cases, the application and the database are bound together in a single package, and there's no difference between business logic and database operational logic. Because of this, added support modules do not interact natively with the original application.
Porting scenarios for database-centric applications are usually done through the Web container, as shown in Figure 9.
Figure 9. Typical database-centric Web enablement scenario
The strategy is similar to the one used for servlet-centric applications. It involves writing a gateway servlet that accesses the proprietary framework through the add-on Java support modules as dependent classes, or through XML files.
Treat it as a monolithic application
The easiest way to grid-enable a database-centric application is to treat it as a monolithic application. In this case, you would need to implement the deployment pattern shown in Figure 9.
But there's an interesting characteristic about database-centric applications that might provide a more efficient way to deploy certain applications on a grid.
Database-centric applications use the database not just to store data but also to keep configuration information, workflow data, and even presentation metadata. This dependency on the database is sometimes so tight that it is impossible to separate the database from the runtime environment. In some cases, the application actually runs on top of the database engine.
In cases where the database run time is the application run time, what can be virtualized is not the business logic but the data. You need to use a data grid instead of a computational grid as in the previous scenarios.
By virtualizing a database-centric application on a data grid, you would be indirectly virtualizing the application run time and, thus, the application itself. All the data the application needs to run will be made available locally on all nodes on the grid and, whenever the application changes, the modifications will be propagated automatically.
What happens is that you get location independence for the requests going to the database while the data is propagated all across the grid. To the requesting program, the data seems to be local all the time when, in reality, it can be located anywhere on the grid. So, what runs as a single-instance batch job is the request broker (the Java interface and a driver for the thick client) while the data is virtualized by the grid infrastructure.
Depending on how the business logic is written, you might be able to virtualize the data and the business logic as shown in Figure 10.
Figure 10. Data virtualization scenario for database-centric applications
Note that you can virtualize data and business logic on a data grid only if the business logic runs in processes triggered by the database engine, which is the case of most proprietary frameworks.
If, on the other hand, the business logic can be triggered by an outside process, such as the job submission driver, you might be able to actually separate data from business logic in what would be a combination of a computational grid for the business logic and a data grid for the database. This approach, however, might not be a feasible solution in some cases because it could introduce too much complexity into the deployment model.
Database-centric applications usually have a characteristic of being heavy on CPU and memory usage. This demand becomes especially visible when the database run time also runs applications.
In some cases, a grid deployment exercise might not work as expected due to the overhead created by the data grid itself, plus the overhead caused by a resource-hungry database runtime environment. In other cases, when it is possible to separate the data and the application, the situation resembles the case of monolithic applications and the same issues will apply.
|
The information in this article should give you enough ammunition to start brainstorming about the best way to grid-enable your product. If you want to grid-enable platform-specific distributed applications (both monolithic and modular) and Web-enabled applications (both servlet-centric and database-centric), this article shows you what to think about.
In this section, we describe at a high level the primary components of a grid environment. Depending on the grid design and its expected use, some of these components may or may not be required, and in some cases they may be combined to form a hybrid component. However, understanding the roles of the components as we describe them here will help you understand the considerations when developing grid-enabled applications.
Just as a consumer sees the power grid as a receptacle in the wall, a grid user should not see all of the complexities of the computing grid. Although the user interface can come in many forms and be application-specific, for the purposes of our discussion, let's think of it as a portal. Most users today understand the concept of a Web portal, where their browser provides a single interface to access a wide variety of information sources. A grid portal provides the interface for a user to launch applications that will use the resources and services provided by the grid. From this perspective, the user sees the grid as a virtual computing resource just as the consumer of power sees the receptacle as an interface to a virtual generator.
The current Globus Toolkit does not provide any services or tools to generate a portal, but this can be accomplished with tools such as WebSphere® Portal and WebSphere Application Server.
A major requirement for grid computing is security. At the base of any grid environment, there must be mechanisms to provide security, including authentication, authorization, data encryption, and so on. The Grid Security Infrastructure (GSI) component of the Globus Toolkit provides robust security mechanisms. The GSI includes an OpenSSL implementation. It also provides a single sign-on mechanism, so that once a user is authenticated, a proxy certificate is created and used when performing actions within the grid. When designing your grid environment, you may use the GSI sign-in to grant access to the portal, or you may have your own security for the portal. The portal will then be responsible for signing in to the grid, either using the user's credentials or using a generic set of credentials for all authorized users of the portal.
Once authenticated, the user will be launching an application. Based on the application, and possibly on other parameters provided by the user, the next step is to identify the available and appropriate resources to use within the grid. This task could be carried out by a broker function. Although there is no broker implementation provided by Globus, there is an LDAP-based information service. This service is called the Grid Information Service (GIS), or more commonly the Monitoring and Discovery Service (MDS). This service provides information about the available resources within the grid and their status. A broker service could be developed that utilizes MDS.
Once the resources have been identified, the next logical step is to schedule the individual jobs to run on them. If a set of stand-alone jobs are to be executed with no interdependencies, then a specialized scheduler may not be required. However, if you want to reserve a specific resource or ensure that different jobs within the application run concurrently (for instance, if they require inter-process communication), then a job scheduler should be used to coordinate the execution of the jobs. The Globus Toolkit does not include such a scheduler, but there are several schedulers available that have been tested with and can be used in a Globus grid environment. It should also be noted that there could be different levels of schedulers within a grid environment. For instance, a cluster could be represented as a single resource. The cluster may have its own scheduler to help manage the nodes it contains. A higher level scheduler (sometimes called a meta scheduler) might be used to schedule work to be done on a cluster, while the cluster's scheduler would handle the actual scheduling of work on the cluster's individual nodes.
If any data -- including application modules -- must be moved or made accessible to the nodes where an application's jobs will execute, then there needs to be a secure and reliable method for moving files and data to various nodes within the grid. The Globus Toolkit contains a data management component that provides such services. This component, known as Grid Access to Secondary Storage (GASS), includes facilities such as GridFTP. GridFTP is built on top of the standard FTP protocol, but adds additional functions and utilizes the GSI for user authentication and authorization. Therefore, once a user has an authenticated proxy certificate, he can use the GridFTP facility to move files without having to go through a login process to every node involved. This facility provides third-party file transfer so that one node can initiate a file transfer between two other nodes.
With all the other facilities we have just discussed in place, we now get to the core set of services that help perform actual work in a grid environment. The Grid Resource Allocation Manager (GRAM) provides the services to actually launch a job on a particular resource, check its status, and retrieve its results when it is complete.
There are other facilities that may need to be included in your grid environment and considered when designing and implementing your application. For instance, inter-process communication and accounting/chargeback services are two common facilities that are often required.