Project Records — Troubleshooting
Here I will record some insightful bugs I encountered in enterprise development, as well as troubleshooting for code
SQL DML Blocking Issue (Oracle)
Bug Description
Recently, our project needs to receive data from the Big Data department’s interface and perform some update operations on the data before inserting it into a table, which we will refer to as the “Business table” in the following text. During the execution of this operation, it was unexpectedly observed that the execution time exceeded twenty minutes.
At first, it was thought to be a problem related to the large data volume, but later it was discovered that this was not the case. Consequently, I had to transform into a DBA to identify the blocking issue’s root cause. Here, I will document the experience of resolving this problem.
Troubleshooting
First, we can use the following SQL query to analyze and confirm whether this query is being blocked
1 |
|
This SQL query is used to detect long-running blocking situations in an Oracle database. It retrieves information about sessions from the v$session
view, particularly the sessions that are blocking others and the sessions that are waiting.
s1
ands2
are aliases for thev$session
view, representing two different sessions.s1.username || '@' || s1.machine || '(SID=)' || s1.sid || ')'
ands2.username || '@' || s2.machine || '(SID=)' || s2.sid || ')'
are used to create identification information for sessions, including the username, machine name, session identifier (SID), and a flag indicating whether the session is blocking.s1.sql_id
ands1.sql_child_number
represent the SQL query ID and child number for the session executing a SQL query (s1
).s1.sql_exec_start
is the start time of the SQL query’s execution.s2.event
represents the event the waiting session (s2
) is currently waiting for.where s1.blocking_session = s2.sid
is the query’s filter condition, which selects sessions that are blocking other sessions (s1
) and sessions waiting for other sessions to be unblocked (s2
).
With the above SQL, wecan identify the SID of the blocking session. Then, we can use the following SQL to confirm the blocking reason:
1 |
|
This SQL query is used to find detailed information about a specified session (sid
) in the database, including the session’s username, the SQL query it is currently executing, the wait event, and other relevant information.
v$session s
andv$sql q
are aliases for thev$session
andv$sql
views, representing session information and SQL query information, respectively.q.sql_text
is used to select the SQL query text for the session.s.sql_id = q.sql_id
ands.sql_child_number = q.child_number
indicate that the session’s SQL query ID and child number match the SQL ID and child number in the SQL query information to associate the session with the query.where s.sid = <blocking_session_id>
is the query’s filter condition, where you should input the previously found blocking SID.
From the above code, I checked s.event
and identified that the issue is related to an update statement.
Now, I have two options:
(1) Terminate the blocking session with the alter system kill session '<sid><serial#>' immediate
command, which will roll back the transaction.
(2) Continue investigating to find the root cause of the problem and make direct modifications to the problematic code.
Clearly, choosing the second option is the right course of action, as selecting the first option and running it again will likely result in errors. So, let’s proceed with the second option and continue by querying v$lock
to examine lock information.
1 |
|
This SQL query is used to examine lock information in an Oracle database, helping to identify which sessions are either holding or requesting locks and providing related session information.
v$lock l
andv$session s
are aliases for thev$lock
view and thev$session
view, representing lock information and session information, respectively.l.sid = s.sid
associates lock information’s session ID (sid
) with session information’s session ID to relate locks to sessions.l.sid
: This represents the session ID (Session ID) of the session holding or requesting the lock. Each session has a unique session ID.s.username
: This represents the username of the session holding or requesting the lock. It identifies the user performing operations in the database.l.type
: This indicates the type of lock. Different lock types are used for various resources or objects, such as tables, rows, indexes, etc.l.id1
andl.id2
: These two columns typically work together and represent the lock’s identifier. Their specific meanings depend on the lock type; for example, for a table lock, they may represent the object identifier of the table.l.lmode
: This represents the lock mode, describing the type of lock, such as shared, exclusive, etc.l.request
: This indicates whether the session is currently requesting a lock, along with the requested lock type and mode.
The query results are associated with SIDs, and it’s observed that the blocking locks are as follows:
type | ID1 | ID2 | LMODE | REQUEST |
---|---|---|---|---|
AE | 100 | 0 | 4 | 0 |
TM | 168385 | 0 | 3 | 0 |
TX | 393225 | 221646 | 0 | 6 |
The above results indicate:
The first row (AE lock):
type
is “AE,” indicating that this is a table-level lock.ID1
is 100, which might represent an identifier for the table.ID2
is 0, typically indicating an ID2 of 0 for table-level locks.LMODE
is 4, indicating the holding of an exclusive lock.REQUEST
is 0, signifying that there is currently no request for a new lock.
The second row (TM lock):
type
is “TM,” indicating that this is a table-level lock.ID1
is 168385, possibly representing another identifier for the table.ID2
is 0, typically indicating an ID2 of 0 for table-level locks.LMODE
is 3, indicating the holding of a shared lock.REQUEST
is 0, indicating that there is currently no request for a new lock.
The third row (TX lock):
type
is “TX,” indicating that this is a transaction-level lock.ID1
is 393225, possibly representing an identifier for some transaction.ID2
is 221646, possibly representing an identifier for another transaction.LMODE
is 0, indicating no lock.REQUEST
is 6, indicating that there is a current request for a new lock.
Next, let’s identify which piece of code the third row’s transaction lock is holding:
1 |
|
This SQL query is used to find information related to specific session (sid
) waiting for row-level locks. Here are the meanings of various parts of the query:
do.object_name
: This represents the name of the data object for which the session is waiting for row-level locks, such as the table’s name or the name of another database object.row_wait_obj#
: This is the object number for the row-level lock the session is waiting for. It is used to identify the data object for which row-level locks are being waited.do.data_object_id
: This is the unique identifier of the data object.row_wait_file#
,row_wait_block#
,row_wait_row#
: These indicate specific location information for the row-level lock, including the file number, block number, and row number being waited for.dbms_rowid.rowid_create(1, do, data_object_id, row_wait_file#, row_wait_block#, row_wait_row#)
: This is a function call used to create the ROWID (Row Identifier) of the row-level lock to precisely locate the row-level lock being waited for.v$session s, dba_objects do
: This part indicates that the query will retrieve information from thev$session
view and thedba_objects
view.where sid = <sessionId>
: This is the query’s filtering condition, specifying thesid
of the specific session to be examined.and s.row_wait_obj# = do.object_id
: This is the query’s join condition, matching the session’srow_wait_obj#
with the data object’sobject_id
to retrieve information about the data object for which row-level locks are being waited.
Next, by inserting the obtained ROWID (AABK/5AAFAAAAHHAAA) into the following SQL query, we can identify the specific row that is causing the issue:
1 |
|
Please note that ‘Business
Solution
After pinpointing the problematic rows, I found that the real problem was the presence of duplicate data. Furthermore, the insert statement following the update was executing too quickly, causing two data elements to simultaneously contend for the same row lock and resulting in a blockage. Consequently, I made modifications to the original Java code by introducing a 50ms delay for each update statement. This adjustment effectively resolved the problem.
Wrapper Object Reference Issue
Bug Description
The issue of ==
and equals
with wrapper objects is a well-discussed topic, but I recently encountered a pitfall.
While completing a recent project, I faced a problem. I noticed within a tree-shaped dropdown menu that some menus had second and third-level submenus, while others, which were supposed to have third-level submenus, were missing.
It looked something like this:
Troubleshooting
So, I delved into investigating how this bug occurred.
Starting with debugging the frontend code, I discovered that the issue originated from the el-tree
component of the Element-UI library. The logic appeared straightforward, and there shouldn’t have been any issues:
1 |
|
Therefore, I shifted my focus to the backend code. The serviceImpl
of the backend’s list/tree
method is a recursive tree lookup method:
1 |
|
The baseMapper.selectList(null)
is used to fetch all the data.
The CategoryEntity
has several core attributes: catId
is the current category’s ID, parentCid
is the ID of the parent category, and the parent category’s ID for the top-level category points to 0. children
is an array used to store the subcategories of the current category, each represented by a CategoryEntity
.
In the listWithTree
method, item.getParentCid() == 0
is used to identify all top-level categories. Then, it sets the children
for each of these categories. But how does it retrieve the subcategories for a particular category?
Next, the getChildren
method is called. This method requires the current category and all the data. It uses categoryEntity.getParentCid() == root.getCatId()
to find all objects in the data where the parent category points to the current category. It then continues to recursively search until it finds null
.
Initially, when studying the code logic, I didn’t find any issues. However, during breakpoint debugging, I noticed that the “E-book” category has its own children
, but why does the “sas” category have an empty children
array?
I then checked the database and confirmed that there are indeed child data entries pointing to the “sas” category. So, the question arose: why can some categories find their children while others cannot?
The breakpoint debugging information from the stream pipeline was somewhat limited, so I resorted to using log.info
to inspect each step. After patiently investigating, I finally discovered the root of the problem:
The condition categoryEntity.getParentCid() == root.getCatId()
was evaluating to []
, which is strange. This indicates that it did not find the appropriate object in the database, which shouldn’t be the case because I had already verified the data in the database.
This led me to ponder a crucial question: What does getParentCid()
and getCatId()
actually retrieve?
1 |
|
So, in the backend code, I noticed the use of Long
instead of long
. This realization led me to the heart of the issue:
For
Long
objects, when the value is between -128 and 127, Java uses caching, meaning identical values will share the same object. However, for values beyond this range, a new object is created each time. This results in the use of==
for comparison, and for values beyond the cache range, even if their content is the same, it may returnfalse
.
This explains why some categories were displayed correctly while others were not. The ones that were displayed correctly hadn’t exceeded the cache range for ID values (from -128 to 127).
Solution
- Understanding the problem, the solution is relatively straightforward. We just need to replace
==
withequals
in the following code:
1 |
|
EsayPoi parsing Excel picture null pointer
Bug Description
A very interesting bug, I have been troubled here for a day.
Now there are two Excel files that are almost exactly the same. When I import and parse, one Excel file will report a null pointer, and the other file will be parsed and imported normally.
The business logic here is:
- The business downloads the Excel template first
- The business enters the information they want to import according to the template, one of which is image information column
- The picture information is pasted into Excel by the business
- Finally, the business clicks Import, parses Excel, and imports the information into the database
This bug is encountered by the business in this process. The business has two almost identical Excel files, and their text information is exactly the same. The only difference is that the pictures are different.
Moreover, even if the picture in the Excel file that can be parsed and imported normally is pasted into the Excel file with problems, a null pointer exception will still be reported, which is very strange.
Troubleshooting
At first I thought that the picture might cover other cells, so I tried to stretch the picture, but it didn’t work.
Then carefully study the back-end code and find that this piece is parsed by calling the importExcelMore
method of the ExcelImportUtil
class of EasyPoi API. So I continued to drill down and found that more precisely, the problem occurred when calling the getSheetPictures07
method of EasyPoi API’s PoiPublicUtil
class.
To be more precise, it is an error reported when executing the code shown below in this placement:
1 |
|
Break point debug , here needs an anchor’s cell1 attribute and cell2 attribute.
Then the cell2 attribute here is empty, so a null pointer is reported.
Knowing the origin of the error, now go back and find out why the cell2 attribute is null? Where is this attribute constructed?
After a long search, I finally found that it was constructed in the getAchorFromParent
method of Poi API’s XSSFDrawing
class.
The method looks like this:
1 |
|
Don’t fully understand the code, just look at the middle here, where the parentXbean
will create the anchor and its cell2 properties.
When we use breakpoint debugging, we will find that if we import a problematic Excel, we will go to this judgment: parentXbean instanceof CTOneCellAnchor
And normal Excel will go to this judgment: parentXbean instanceof CTTtwoCellAnchor
Then, the problem lies in CTTwoCellAnchor and CTOneCellAnchor.
What are these two classes? Apparently something to do with pictures - thanks to Greg Woolsey for explaining it right here. refer to:
Therefore, CTTwoCellAnchor means that Move and size with cells; CTOneCellAnchor means that Move but don’t size with cells.
Then, this corresponds to the picture properties in Excel:
Solution
- At this point, the problem is solved, you only need to modify the picture properties in Excel. So far, the problem has been solved. As for the deeper reason, due to the tight schedule of the project, there is no time to delve into it. Friends who understand are welcome to leave a message in the comment area
Json default parsing method
Bug Description
A very normal operation: I call a method on the backend and return a List result set to the frontend. We know that this result set is placed in a List with a lot of json data at the front end.
The data I’m going to return has these columns:
- DESCR - Department name
- userId - ID
- name - name
- jobType - job type
Among them, userId, name, and jobType are displayed normally on the front end, and the data is obtained, but there is no data show in the column DESCR:
DESCR | userId | name | jobType |
---|---|---|---|
1 | a | A | |
2 | b | B | |
3 | c | C |
After looking at the backend, debug at the breakpoint found that the data was passed normally. The column DESCR has a value, but I don’t know why it is not displayed on the frontend.
Troubleshooting
Debug at the break point on the front end, and find that the Json data passed looks like this:
1 |
|
This is strange, why is the first letter of the DESCR column lowercase?
It was later discovered that not only the first letter of the DESCR column is lowercased, but in fact the first letter of each column with a capitalized first letter will be lowercased, because they are originally lowercased here, so they cannot be seen here.
The reason for this is to discuss the principle behind it because all JSON implementations are inseparable from HttpMessageConverter, which is a message conversion tool that mainly implements two functions:
- Serialize the object returned by the server into a JSON string
- Deserialize the JSON string sent from the front end into a Java object
Relevant dependencies have been automatically imported for us in the dependency package generated by SpringBoot. SpringMVC automatically configures Jackson and Gson’s HttpMessageConverter.
By default, our company’s project uses Jackson to parse the Json code. During the serialization process, Jackson will automatically convert fields’ first letter beginning with uppercase to lowercase.
Solution
- Add
@JsonFormat
or@JsonProperty
annotations to unchanging properties that need to preserve case - Annotate the class with *`@JsonAutoDetect(getterVisibility=JsonAutoDetect.Visibility.NONE)
BatchUpdateException: ORA-12899
Oracle SQL table design problem
Bug Description
The product manager asked me to check for a bug that occurred in an unknown environment (I didn’t ask whether it was a development, test, or formal environment). The front-end failed to upload excel to update data, and it prompted “Database exception, please contact the administrator”.
Troubleshooting
Obviously, this is usually an error reported by the backend, so I reproduced the error according to the process, and then directly looked at the result of the idea, and the idea reported an error:
Exception in thread “main” java.sql.BatchUpdateException: ORA-12899: value too large for column “xx” (actual: 208, maximum: 200)
According to the literal meaning of such an error, it seems that the data exceeds the byte limit of a certain field in sql, so through our Project Record-Enterprise Requirements blog, “Frontend Code Mapping Backend Code“ method, finds the corresponding back-end code, and then finds the corresponding entity class to see which data table it links to. This helps us directly locate the problem table in the company’s huge database.
After locating the problem table, click “Design Table” in Navicat, I can directly see the byte limit of each field, and then correspond to the content in the excel file that reported the error, I can easily find the corresponding limit of 200 bytes offending column. But the strange thing is that if a Chinese code occupies 2 bytes, I personally feel that the number of bytes of data in this column in this excel file does not reach the standard of 208.
Then query:
1 |
|
The result is AMERICAN_AMERICA.AL32UTF8
This shows that our Oracle database uses AMERICAN_AMERICA.AL32UTF8
as the character encoding set. This UTF-8 encoding set occupies three bytes for a Chinese character. In order to verify our guess, query:
1 |
|
xx is the data content that exceeds the byte limit, and the answer is 208 bytes, which proves that I am correct.
Solution
- Change the character encoding set in oracle sql to
SIMPLIFIED CHINESE_CHINA.AL32UTF8
, such a Chinese only occupies two bytes. But obviously this method may affect other tables - It is not required to upload too much content corresponding to the data column of excel
- Expand the byte limit of the field